REF: slice before constructing JoinUnit #52542

jbrockmendel · 2023-04-08T18:27:59Z

I'm seeing small speedups in the case from #50652, not as much as I had hoped for. The main upside is significantly simplifying JoinUnit.

topper-123

Nice, looks good.

mroeschke · 2023-04-10T18:25:27Z

Thanks @jbrockmendel

glemaitre · 2023-04-13T09:56:12Z

It looks like this PR made a change in behaviour (for the best):

import pandas as pd

arr1 = [1.0, '1', 'Allen, Miss. Elisabeth Walton', 'female', 29.0, 0.0, 0.0, '24160', 211.3375, 'B5', 'S', '2', None, 'St Louis, MO']
arr2 = [
    [1.0, '1', 'Allison, Master. Hudson Trevor', 'male', 0.9167, 1.0, 2.0, '113781', 151.55, 'C22 C26', 'S', '11', None, 'Montreal, PQ / Chesterville, ON'],
    [1.0, '0', 'Allison, Miss. Helen Loraine', 'female', 2.0, 1.0, 2.0, '113781', 151.55, 'C22 C26', 'S', None, None, 'Montreal, PQ / Chesterville, ON'],
    [1.0, '0', 'Allison, Mr. Hudson Joshua Creighton', 'male', 30.0, 1.0, 2.0, '113781', 151.55, 'C22 C26', 'S', None, 135.0, 'Montreal, PQ / Chesterville, ON']

pd.concat([pd.DataFrame([arr1]), pd.DataFrame(arr2)], ignore_index=True)
]

The previous code would output:

    0  1                                     2       3        4    5    6       7         8        9  10    11     12                               13
0  1.0  1         Allen, Miss. Elisabeth Walton  female  29.0000  0.0  0.0   24160  211.3375       B5  S     2   None                     St Louis, MO
1  1.0  1        Allison, Master. Hudson Trevor    male   0.9167  1.0  2.0  113781  151.5500  C22 C26  S    11    NaN  Montreal, PQ / Chesterville, ON
2  1.0  0          Allison, Miss. Helen Loraine  female   2.0000  1.0  2.0  113781  151.5500  C22 C26  S  None    NaN  Montreal, PQ / Chesterville, ON
3  1.0  0  Allison, Mr. Hudson Joshua Creighton    male  30.0000  1.0  2.0  113781  151.5500  C22 C26  S  None  135.0  Montreal, PQ / Chesterville, ON

while main output:

    0  1                                     2       3        4    5    6       7         8        9  10    11     12                               13
0  1.0  1         Allen, Miss. Elisabeth Walton  female  29.0000  0.0  0.0   24160  211.3375       B5  S     2    NaN                     St Louis, MO
1  1.0  1        Allison, Master. Hudson Trevor    male   0.9167  1.0  2.0  113781  151.5500  C22 C26  S    11    NaN  Montreal, PQ / Chesterville, ON
2  1.0  0          Allison, Miss. Helen Loraine  female   2.0000  1.0  2.0  113781  151.5500  C22 C26  S  None    NaN  Montreal, PQ / Chesterville, ON
3  1.0  0  Allison, Mr. Hudson Joshua Creighton    male  30.0000  1.0  2.0  113781  151.5500  C22 C26  S  None  135.0  Montreal, PQ / Chesterville, ON

Looking at column 12, the dtype has changed from object to float64. Somehow, in our specific use case, it is better because we have a more appropriate casting.

However, it introduces an inconsistency between DataFrame and Series:

In [6]: pd.concat([pd.DataFrame([[None]]), pd.DataFrame([[None], [None], [1.0]])], ignore_index=True)
Out[6]: 
     0
0  NaN
1  NaN
2  NaN
3  1.0

In [7]: pd.concat([pd.Series([None]), pd.Series([None, None, 1.0])], ignore_index=True)
Out[7]: 
0    None
1     NaN
2     NaN
3     1.0
dtype: object

I don't know until which point some of the behaviour could be considered as breaking change here.

mroeschke · 2023-04-17T17:30:17Z

@jbrockmendel might be good to have a follow up with a unit test and whatsnew entry about this change

PERF: slice before constructing JoinUnit

61899ff

topper-123 approved these changes Apr 9, 2023

View reviewed changes

topper-123 added Refactor Internal refactoring of code Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 9, 2023

topper-123 added this to the 2.1 milestone Apr 9, 2023

mroeschke approved these changes Apr 10, 2023

View reviewed changes

mroeschke merged commit 0cb4d1b into pandas-dev:main Apr 10, 2023

jbrockmendel deleted the perf-concat-indexers branch April 10, 2023 18:54

glemaitre mentioned this pull request Apr 12, 2023

⚠️ CI failed on Linux_Nightly.pylatest_pip_scipy_dev ⚠️ scikit-learn/scikit-learn#26154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: slice before constructing JoinUnit #52542

REF: slice before constructing JoinUnit #52542

jbrockmendel commented Apr 8, 2023

topper-123 left a comment

mroeschke commented Apr 10, 2023

glemaitre commented Apr 13, 2023

mroeschke commented Apr 17, 2023

REF: slice before constructing JoinUnit #52542

REF: slice before constructing JoinUnit #52542

Conversation

jbrockmendel commented Apr 8, 2023

topper-123 left a comment

Choose a reason for hiding this comment

mroeschke commented Apr 10, 2023

glemaitre commented Apr 13, 2023

mroeschke commented Apr 17, 2023