Skip to content

REF: slice before constructing JoinUnit #52542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 10, 2023

Conversation

jbrockmendel
Copy link
Member

I'm seeing small speedups in the case from #50652, not as much as I had hoped for. The main upside is significantly simplifying JoinUnit.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks good.

@topper-123 topper-123 added Refactor Internal refactoring of code Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 9, 2023
@topper-123 topper-123 added this to the 2.1 milestone Apr 9, 2023
@mroeschke mroeschke merged commit 0cb4d1b into pandas-dev:main Apr 10, 2023
@mroeschke
Copy link
Member

Thanks @jbrockmendel

@glemaitre
Copy link
Contributor

It looks like this PR made a change in behaviour (for the best):

import pandas as pd

arr1 = [1.0, '1', 'Allen, Miss. Elisabeth Walton', 'female', 29.0, 0.0, 0.0, '24160', 211.3375, 'B5', 'S', '2', None, 'St Louis, MO']
arr2 = [
    [1.0, '1', 'Allison, Master. Hudson Trevor', 'male', 0.9167, 1.0, 2.0, '113781', 151.55, 'C22 C26', 'S', '11', None, 'Montreal, PQ / Chesterville, ON'],
    [1.0, '0', 'Allison, Miss. Helen Loraine', 'female', 2.0, 1.0, 2.0, '113781', 151.55, 'C22 C26', 'S', None, None, 'Montreal, PQ / Chesterville, ON'],
    [1.0, '0', 'Allison, Mr. Hudson Joshua Creighton', 'male', 30.0, 1.0, 2.0, '113781', 151.55, 'C22 C26', 'S', None, 135.0, 'Montreal, PQ / Chesterville, ON']

pd.concat([pd.DataFrame([arr1]), pd.DataFrame(arr2)], ignore_index=True)
]

The previous code would output:

    0  1                                     2       3        4    5    6       7         8        9  10    11     12                               13
0  1.0  1         Allen, Miss. Elisabeth Walton  female  29.0000  0.0  0.0   24160  211.3375       B5  S     2   None                     St Louis, MO
1  1.0  1        Allison, Master. Hudson Trevor    male   0.9167  1.0  2.0  113781  151.5500  C22 C26  S    11    NaN  Montreal, PQ / Chesterville, ON
2  1.0  0          Allison, Miss. Helen Loraine  female   2.0000  1.0  2.0  113781  151.5500  C22 C26  S  None    NaN  Montreal, PQ / Chesterville, ON
3  1.0  0  Allison, Mr. Hudson Joshua Creighton    male  30.0000  1.0  2.0  113781  151.5500  C22 C26  S  None  135.0  Montreal, PQ / Chesterville, ON

while main output:

    0  1                                     2       3        4    5    6       7         8        9  10    11     12                               13
0  1.0  1         Allen, Miss. Elisabeth Walton  female  29.0000  0.0  0.0   24160  211.3375       B5  S     2    NaN                     St Louis, MO
1  1.0  1        Allison, Master. Hudson Trevor    male   0.9167  1.0  2.0  113781  151.5500  C22 C26  S    11    NaN  Montreal, PQ / Chesterville, ON
2  1.0  0          Allison, Miss. Helen Loraine  female   2.0000  1.0  2.0  113781  151.5500  C22 C26  S  None    NaN  Montreal, PQ / Chesterville, ON
3  1.0  0  Allison, Mr. Hudson Joshua Creighton    male  30.0000  1.0  2.0  113781  151.5500  C22 C26  S  None  135.0  Montreal, PQ / Chesterville, ON

Looking at column 12, the dtype has changed from object to float64. Somehow, in our specific use case, it is better because we have a more appropriate casting.

However, it introduces an inconsistency between DataFrame and Series:

In [6]: pd.concat([pd.DataFrame([[None]]), pd.DataFrame([[None], [None], [1.0]])], ignore_index=True)
Out[6]: 
     0
0  NaN
1  NaN
2  NaN
3  1.0

In [7]: pd.concat([pd.Series([None]), pd.Series([None, None, 1.0])], ignore_index=True)
Out[7]: 
0    None
1     NaN
2     NaN
3     1.0
dtype: object

I don't know until which point some of the behaviour could be considered as breaking change here.

@mroeschke
Copy link
Member

@jbrockmendel might be good to have a follow up with a unit test and whatsnew entry about this change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Internal refactoring of code Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants