Skip to content

ENH: Add support for ea dtypes in merge #49876

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Nov 29, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 23, 2022

@phofl phofl added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Nov 23, 2022
dtype = find_common_type([lk.dtype, rk.dtype])
if isinstance(dtype, ExtensionDtype):
cls = dtype.construct_array_type()
if not isinstance(lk, ExtensionArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of all of the if checks can we set up another function or class to dispatch to that handles this more gracefully? I think ideally could also eliminate the need for the factorizer dict

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but would prefer to do this as a follow up.

Getting rid of the dict is probably not easy if doable at all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not for the dict at least all these if checks I think should be wrapped in a separate function. We are losing a good bit of abstraction here compared to the previous code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this makes sense, but as I said would prefer doing this as a follow up.

@mroeschke
Copy link
Member

ASV failures might be real

2022-11-24T01:09:56.6735358Z [ 30.77%] ··· join_merge.MergeEA.time_merge                          2/10 failed
2022-11-24T01:09:56.6736230Z [ 30.77%] ··· ========= ==========
2022-11-24T01:09:56.6738310Z                 dtype             
2022-11-24T01:09:56.6738758Z               --------- ----------
2022-11-24T01:09:56.6739260Z                 Int64    12.8±0ms 
2022-11-24T01:09:56.6740223Z                 Int32    12.5±0ms 
2022-11-24T01:09:56.6740600Z                 Int16    11.4±0ms 
2022-11-24T01:09:56.6740850Z                  Int8     failed  
2022-11-24T01:09:56.6741171Z                 UInt64   11.9±0ms 
2022-11-24T01:09:56.6741481Z                 UInt32   12.8±0ms 
2022-11-24T01:09:56.6741774Z                 UInt16   11.6±0ms 
2022-11-24T01:09:56.6742021Z                 UInt8     failed  
2022-11-24T01:09:56.6742334Z                Float64   12.6±0ms 
2022-11-24T01:09:56.6742655Z                Float32   13.1±0ms 
2022-11-24T01:09:56.6742899Z               ========= ==========

@phofl
Copy link
Member Author

phofl commented Nov 28, 2022

Oh interesting, I’ll rerun them locally to check, they were passing at some point

@phofl
Copy link
Member Author

phofl commented Nov 28, 2022

overflow error, removed them since a useful test is not possible

@phofl
Copy link
Member Author

phofl commented Nov 28, 2022

       before           after         ratio
     [14dc069f]       [38d558c6]
-      5.32±0.2ms      2.44±0.04ms     0.46  join_merge.MergeEA.time_merge('UInt64')
-      5.23±0.3ms      2.26±0.06ms     0.43  join_merge.MergeEA.time_merge('Float64')
-     5.38±0.05ms      2.25±0.02ms     0.42  join_merge.MergeEA.time_merge('UInt32')
-      5.32±0.3ms      2.19±0.03ms     0.41  join_merge.MergeEA.time_merge('Float32')
-      5.55±0.5ms      2.28±0.08ms     0.41  join_merge.MergeEA.time_merge('Int64')
-     5.39±0.08ms      2.15±0.01ms     0.40  join_merge.MergeEA.time_merge('UInt16')
-      5.51±0.5ms      2.07±0.02ms     0.38  join_merge.MergeEA.time_merge('Int32')
-      5.70±0.5ms      2.06±0.04ms     0.36  join_merge.MergeEA.time_merge('Int16')

@mroeschke mroeschke added this to the 2.0 milestone Nov 29, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM merge when ready @WillAyd

@WillAyd WillAyd merged commit c09ac01 into pandas-dev:main Nov 29, 2022
@WillAyd
Copy link
Member

WillAyd commented Nov 29, 2022

Thanks @phofl

@phofl
Copy link
Member Author

phofl commented Nov 29, 2022

Thx, will open a pr to refactor tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement NA - MaskedArrays Related to pd.NA and nullable extension arrays Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: unnecessary casting in merge
3 participants