-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: groupby with string dtype #43634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
debnathshoham
commented
Sep 18, 2021
•
edited by simonjayhawkins
Loading
edited by simonjayhawkins
- closes PERF: groupby performance regression in 1.2.x #41596
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
does this also fix the case mentioned in #41596 (comment)? if so can you post the timings for that too for df_str.groupby('a')[cols[1:]].last() on master and with this PR |
Hi @simonjayhawkins - no this PR doesn't address the performance slowdown in the Also note, that I have changed the size of the |
That's okay, but we can't close that issue. will need to open a new one specifically for that case, (regression between 1.2.5 and 1.3) or change the OP to read "xref" instead of "closes" out of curiosity, did you identify which commit caused the slowdown between 1.1.5 and 1.2? |
Updated the top comment. |
So why do we need to special case an specific EA dtype? (Maybe this was done before). In general EA handling should be more generic. |
I think the current implementation of I can see a TODO for a more generic approach. |
yes, it appears that there is intent to make |
pandas/core/groupby/ops.py
Outdated
@@ -348,6 +349,9 @@ def _ea_wrap_cython_operation( | |||
elif isinstance(values.dtype, FloatingDtype): | |||
# FloatingArray | |||
npvalues = values.to_numpy(values.dtype.numpy_dtype, na_value=np.nan) | |||
elif isinstance(values.dtype, StringDtype) and self.how in ["last", "first"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there other functions (e.g. .sum) as well (or just have all string functions hit this path)? do we have sufficient asv's to cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, other functions do hit this path.
asv coverage is little spotty, i was not able to find anything with groupby and StringDtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, other functions do hit this path.
ok which ones? what if you remove the self.how entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't really like the self.how here, this is way too specific.
can you merge master to fix doc build #43688 |
changed back to "closes" xref #41596 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add asvs that hit this case
in the asvs |
also merge master as the deprecation warnings on asv's are silenced now |
seems still some CI / checks are failing, also make sure you merge |
thanks @debnathshoham very nice! |
This comment has been minimized.
This comment has been minimized.
@meeseeksdev backport 1.3.x |
This comment has been minimized.
This comment has been minimized.
@debnathshoham if you wouldn't mind following the procedure above to backport |