-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Speed up StringDtype arrow implementation for merge #54510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/reshape/merge.py
Outdated
@@ -76,6 +76,7 @@ | |||
na_value_for_dtype, | |||
) | |||
|
|||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid this style of import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that shouldn't have gotten in
pandas/core/reshape/merge.py
Outdated
@@ -2407,13 +2408,20 @@ def _factorize_keys( | |||
or is_string_dtype(lk.dtype) | |||
and not sort | |||
) | |||
or is_string_dtype(lk.dtype) | |||
and lk.dtype.storage == "pyarrow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick can you put parens around the these so i dont have to think about or/and ordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for both
pandas/core/reshape/merge.py
Outdated
isinstance(lk.dtype, ArrowDtype) | ||
and is_string_dtype(lk.dtype) | ||
or isinstance(lk.dtype, pd.StringDtype) | ||
and lk.dtype.storage == "pyarrow" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same parens request
pandas/core/reshape/merge.py
Outdated
@@ -2407,13 +2407,16 @@ def _factorize_keys( | |||
or is_string_dtype(lk.dtype) | |||
and not sort | |||
) | |||
or (is_string_dtype(lk.dtype) and lk.dtype.storage == "pyarrow") # type: ignore[attr-defined] # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would using isinstance(lk.dtype, StringDtype)
fix the mypy complaint? (and possibly be more robust to object dtype cases?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still have to check the storage (python strings...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but doing the isinstace check instead of is_string_dtype then makes the mypy complaint go away? (and excludes object dtype and is just better in general)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got you, done
Can you add a whatsnew? @jbrockmendel @phofl |
There already a whatsnew from the previous pr, this should be ready |
pandas/core/reshape/merge.py
Outdated
@@ -2407,13 +2408,16 @@ def _factorize_keys( | |||
or is_string_dtype(lk.dtype) | |||
and not sort | |||
) | |||
or (isinstance(lk.dtype, StringDtype) and lk.dtype.storage == "pyarrow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the compound if statement here is non-obvious. can it be de-compounded by moving the pyarrow/string case up above this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah not a problem.
Good to merge then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have tests that hit the affected cases? if so then yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, test_merge_string_dtype
for example
…entation for merge) (#54689) Backport PR #54510: Speed up StringDtype arrow implementation for merge Co-authored-by: Patrick Hoefler <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.sorry @jbrockmendel :) I promise I'll work on an EA implementation after that string dtype with NumPy semantics