Speed up StringDtype arrow implementation for merge #54510

phofl · 2023-08-12T07:35:54Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

sorry @jbrockmendel :) I promise I'll work on an EA implementation after that string dtype with NumPy semantics

jbrockmendel · 2023-08-12T07:41:26Z

pandas/core/reshape/merge.py

@@ -76,6 +76,7 @@
    na_value_for_dtype,
 )

+import pandas as pd


can we avoid this style of import

Sorry that shouldn't have gotten in

jbrockmendel · 2023-08-12T07:42:07Z

pandas/core/reshape/merge.py

@@ -2407,13 +2408,20 @@ def _factorize_keys(
                or is_string_dtype(lk.dtype)
                and not sort
            )
+            or is_string_dtype(lk.dtype)
+            and lk.dtype.storage == "pyarrow"


nitpick can you put parens around the these so i dont have to think about or/and ordering

Added for both

jbrockmendel · 2023-08-12T07:42:21Z

pandas/core/reshape/merge.py

+            isinstance(lk.dtype, ArrowDtype)
+            and is_string_dtype(lk.dtype)
+            or isinstance(lk.dtype, pd.StringDtype)
+            and lk.dtype.storage == "pyarrow"


same parens request

jbrockmendel · 2023-08-16T16:46:15Z

pandas/core/reshape/merge.py

@@ -2407,13 +2407,16 @@ def _factorize_keys(
                or is_string_dtype(lk.dtype)
                and not sort
            )
+            or (is_string_dtype(lk.dtype) and lk.dtype.storage == "pyarrow")  # type: ignore[attr-defined]  # noqa: E501


would using isinstance(lk.dtype, StringDtype) fix the mypy complaint? (and possibly be more robust to object dtype cases?)

You still have to check the storage (python strings...)

yes but doing the isinstace check instead of is_string_dtype then makes the mypy complaint go away? (and excludes object dtype and is just better in general)

ah got you, done

lithomas1 · 2023-08-21T17:41:16Z

Can you add a whatsnew?

@jbrockmendel @phofl
How close is to being merged?
Should I move this off the milestone or is this fine to keep?

phofl · 2023-08-21T17:56:31Z

There already a whatsnew from the previous pr, this should be ready

cc @jbrockmendel

jbrockmendel · 2023-08-21T19:58:33Z

pandas/core/reshape/merge.py

@@ -2407,13 +2408,16 @@ def _factorize_keys(
                or is_string_dtype(lk.dtype)
                and not sort
            )
+            or (isinstance(lk.dtype, StringDtype) and lk.dtype.storage == "pyarrow")


the compound if statement here is non-obvious. can it be de-compounded by moving the pyarrow/string case up above this one?

Yeah not a problem.

Good to merge then?

do we have tests that hit the affected cases? if so then yes.

Yep, test_merge_string_dtype for example

…on for merge

…entation for merge) (#54689) Backport PR #54510: Speed up StringDtype arrow implementation for merge Co-authored-by: Patrick Hoefler <[email protected]>

Speed up StringDtype arrow implementation

2b96945

phofl added this to the 2.1 milestone Aug 12, 2023

phofl added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 12, 2023

jbrockmendel reviewed Aug 12, 2023

View reviewed changes

phofl added 2 commits August 12, 2023 10:30

Fixups

c05918e

Fixups

97cbdbb

phofl changed the title ~~Speed up StringDtype arrow implementation~~ Speed up StringDtype arrow implementation for merge Aug 12, 2023

jbrockmendel reviewed Aug 16, 2023

View reviewed changes

Update

f592c48

jbrockmendel reviewed Aug 21, 2023

View reviewed changes

Update

b0b9962

phofl merged commit 1845699 into pandas-dev:main Aug 22, 2023

phofl deleted the stringdtype_merge branch August 22, 2023 15:05

meeseeksmachine mentioned this pull request Aug 22, 2023

Backport PR #54510 on branch 2.1.x (Speed up StringDtype arrow implementation for merge) #54689

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 22, 2023

Backport PR pandas-dev#54510: Speed up StringDtype arrow implementati…

1dae2d7

…on for merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up StringDtype arrow implementation for merge #54510

Speed up StringDtype arrow implementation for merge #54510

phofl commented Aug 12, 2023

jbrockmendel Aug 12, 2023

phofl Aug 12, 2023

jbrockmendel Aug 12, 2023

phofl Aug 12, 2023

jbrockmendel Aug 12, 2023

jbrockmendel Aug 16, 2023

phofl Aug 16, 2023

jbrockmendel Aug 16, 2023

phofl Aug 16, 2023

lithomas1 commented Aug 21, 2023

phofl commented Aug 21, 2023

jbrockmendel Aug 21, 2023

phofl Aug 21, 2023

jbrockmendel Aug 22, 2023

phofl Aug 22, 2023

Speed up StringDtype arrow implementation for merge #54510

Speed up StringDtype arrow implementation for merge #54510

Conversation

phofl commented Aug 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lithomas1 commented Aug 21, 2023

phofl commented Aug 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment