Skip to content

BUG: transform and filter misbehave when grouping on categorical data #9994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2015

Conversation

evanpw
Copy link
Contributor

@evanpw evanpw commented Apr 26, 2015

Fixes GH #9921
closes #9700

names = result.columns
result = obj.merge(result, how='outer', left_index=True, right_index=True).iloc[:,-result.shape[1]:]
result.columns = names
return result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section was almost unreachable (see the conditions above). In fact, it's no longer reached by test case that motivated it. And it's still not locked down enough: it can only work correctly when the initial groupby was on the index, which transform doesn't know. I have an idea on how to make something like this work in the general case, but I'll wait until this PR goes through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably ; pls run a perf test to validate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is about a 25% slowdown in some cases:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_transform_series2                    | 124.4840 | 124.6049 |   0.9990 |
groupby_transform                            | 152.5413 | 151.6960 |   1.0056 |
groupby_transform_series                     |  20.0484 |  19.6927 |   1.0181 |
groupby_transform_ufunc                      | 112.1867 | 110.1057 |   1.0189 |
groupby_transform_multi_key4                 | 156.6887 | 138.9330 |   1.1278 |
groupby_transform_multi_key3                 | 918.0430 | 758.8653 |   1.2098 |
groupby_transform_multi_key2                 |  57.8043 |  46.5036 |   1.2430 |
groupby_transform_multi_key1                 |  87.1673 |  69.4257 |   1.2555 |

but that's because of the rest of the change, not this particular piece. (Because anything with a MultiIndex already takes the slow path). The groupby_transform_ufunc test is the relevant one, and there's no change there.

This piece of the change also fixes GH #9700. I'll add a test for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok if u can profile and see if u can fix would be gr8

@evanpw
Copy link
Contributor Author

evanpw commented Apr 27, 2015

That small change fixed most of the performance difference:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_transform_series2                    | 124.1800 | 124.5937 |   0.9967 |
groupby_transform_series                     |  19.8309 |  19.8836 |   0.9974 |
groupby_transform                            | 151.5836 | 151.8770 |   0.9981 |
groupby_transform_ufunc                      | 112.3510 | 110.2444 |   1.0191 |
groupby_transform_multi_key4                 | 143.6467 | 137.7186 |   1.0430 |
groupby_transform_multi_key3                 | 802.0917 | 753.2163 |   1.0649 |
groupby_transform_multi_key2                 |  49.4637 |  46.3487 |   1.0672 |
groupby_transform_multi_key1                 |  73.4100 |  68.7597 |   1.0676 |

@jreback jreback added Bug Categorical Categorical Data Type labels Apr 28, 2015
@jreback jreback added this to the 0.16.1 milestone Apr 28, 2015
@@ -3126,8 +3128,8 @@ def filter(self, func, dropna=True, *args, **kwargs):
# interpret the result of the filter
if (isinstance(res, (bool, np.bool_)) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to is_bool_dtype

@jreback
Copy link
Contributor

jreback commented Apr 28, 2015

lgtm. pls squash, and ping when green.

@evanpw
Copy link
Contributor Author

evanpw commented Apr 28, 2015

Conditional is cleaned up, squashed, and tests are green.

jreback added a commit that referenced this pull request Apr 29, 2015
BUG: transform and filter misbehave when grouping on categorical data
@jreback jreback merged commit c9d1ef9 into pandas-dev:master Apr 29, 2015
@jreback
Copy link
Contributor

jreback commented Apr 29, 2015

@evanpw thanks!

@evanpw evanpw deleted the issue_9921 branch June 10, 2015 13:41
@evanpw evanpw restored the issue_9921 branch September 19, 2015 00:34
@evanpw evanpw deleted the issue_9921 branch September 19, 2015 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby.transform confused about index
2 participants