Skip to content

Speed up StringDtype arrow implementation for merge #54510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 22, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 12, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

sorry @jbrockmendel :) I promise I'll work on an EA implementation after that string dtype with NumPy semantics

@phofl phofl added this to the 2.1 milestone Aug 12, 2023
@phofl phofl added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 12, 2023
@@ -76,6 +76,7 @@
na_value_for_dtype,
)

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid this style of import

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that shouldn't have gotten in

@@ -2407,13 +2408,20 @@ def _factorize_keys(
or is_string_dtype(lk.dtype)
and not sort
)
or is_string_dtype(lk.dtype)
and lk.dtype.storage == "pyarrow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick can you put parens around the these so i dont have to think about or/and ordering

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for both

isinstance(lk.dtype, ArrowDtype)
and is_string_dtype(lk.dtype)
or isinstance(lk.dtype, pd.StringDtype)
and lk.dtype.storage == "pyarrow"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same parens request

@phofl phofl changed the title Speed up StringDtype arrow implementation Speed up StringDtype arrow implementation for merge Aug 12, 2023
@@ -2407,13 +2407,16 @@ def _factorize_keys(
or is_string_dtype(lk.dtype)
and not sort
)
or (is_string_dtype(lk.dtype) and lk.dtype.storage == "pyarrow") # type: ignore[attr-defined] # noqa: E501
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would using isinstance(lk.dtype, StringDtype) fix the mypy complaint? (and possibly be more robust to object dtype cases?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still have to check the storage (python strings...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but doing the isinstace check instead of is_string_dtype then makes the mypy complaint go away? (and excludes object dtype and is just better in general)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got you, done

@lithomas1
Copy link
Member

Can you add a whatsnew?

@jbrockmendel @phofl
How close is to being merged?
Should I move this off the milestone or is this fine to keep?

@phofl
Copy link
Member Author

phofl commented Aug 21, 2023

There already a whatsnew from the previous pr, this should be ready

cc @jbrockmendel

@@ -2407,13 +2408,16 @@ def _factorize_keys(
or is_string_dtype(lk.dtype)
and not sort
)
or (isinstance(lk.dtype, StringDtype) and lk.dtype.storage == "pyarrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the compound if statement here is non-obvious. can it be de-compounded by moving the pyarrow/string case up above this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not a problem.

Good to merge then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests that hit the affected cases? if so then yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, test_merge_string_dtype for example

@phofl phofl merged commit 1845699 into pandas-dev:main Aug 22, 2023
@phofl phofl deleted the stringdtype_merge branch August 22, 2023 15:05
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 22, 2023
mroeschke pushed a commit that referenced this pull request Aug 22, 2023
…entation for merge) (#54689)

Backport PR #54510: Speed up StringDtype arrow implementation for merge

Co-authored-by: Patrick Hoefler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants