-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: transform and filter misbehave when grouping on categorical data #9994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
names = result.columns | ||
result = obj.merge(result, how='outer', left_index=True, right_index=True).iloc[:,-result.shape[1]:] | ||
result.columns = names | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section was almost unreachable (see the conditions above). In fact, it's no longer reached by test case that motivated it. And it's still not locked down enough: it can only work correctly when the initial groupby
was on the index, which transform
doesn't know. I have an idea on how to make something like this work in the general case, but I'll wait until this PR goes through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably ; pls run a perf test to validate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is about a 25% slowdown in some cases:
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
groupby_transform_series2 | 124.4840 | 124.6049 | 0.9990 |
groupby_transform | 152.5413 | 151.6960 | 1.0056 |
groupby_transform_series | 20.0484 | 19.6927 | 1.0181 |
groupby_transform_ufunc | 112.1867 | 110.1057 | 1.0189 |
groupby_transform_multi_key4 | 156.6887 | 138.9330 | 1.1278 |
groupby_transform_multi_key3 | 918.0430 | 758.8653 | 1.2098 |
groupby_transform_multi_key2 | 57.8043 | 46.5036 | 1.2430 |
groupby_transform_multi_key1 | 87.1673 | 69.4257 | 1.2555 |
but that's because of the rest of the change, not this particular piece. (Because anything with a MultiIndex already takes the slow path). The groupby_transform_ufunc
test is the relevant one, and there's no change there.
This piece of the change also fixes GH #9700. I'll add a test for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok if u can profile and see if u can fix would be gr8
That small change fixed most of the performance difference:
|
@@ -3126,8 +3128,8 @@ def filter(self, func, dropna=True, *args, **kwargs): | |||
# interpret the result of the filter | |||
if (isinstance(res, (bool, np.bool_)) or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to is_bool_dtype
lgtm. pls squash, and ping when green. |
Conditional is cleaned up, squashed, and tests are green. |
BUG: transform and filter misbehave when grouping on categorical data
@evanpw thanks! |
Fixes GH #9921
closes #9700