-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GroupBy aggregation of DataFrame with MultiIndex columns breaks with custom function #32040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c4ea34e
to
4556c63
Compare
pandas/core/groupby/generic.py
Outdated
@@ -968,7 +968,8 @@ def aggregate(self, func=None, *args, **kwargs): | |||
result = self._aggregate_frame(func) | |||
else: | |||
result.columns = Index( | |||
result.columns.levels[0], name=self._selected_obj.columns.name | |||
[i[:-1] if len(i) > 2 else i[0] for i in result.columns], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you actaully trying to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Currently, when we get here, if we have
>>> result.columns
MultiIndex([('B', '<lambda>'),
('C', '<lambda>'),
('D', '<lambda>')],
)
then the current line of code on master
ignores the last level:
>>> Index(result.columns.levels[0], name=self._selected_obj.columns.name)
Index(['B', 'C', 'D'], dtype='object')
However, if (as in the new test I've added) we have
>>> result.columns
MultiIndex([(1, 3, '<lambda>'),
(1, 4, '<lambda>'),
(2, 3, '<lambda>'),
(2, 4, '<lambda>')],
)
then the current line of code on master
will only take the first level:
>>> Index(result.columns.levels[0], name=self._selected_obj.columns.name)
Int64Index([1, 2], dtype='int64')
However, what we need is
MultiIndex([(1, 3),
(1, 4),
(2, 3),
(2, 4)],
)
So, that's what I'm trying to do here - if there's only two levels, return the first one as an Index (as is currently done on master
), and if there's more levels, return all but the last one as a MultiIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe misreading but this entire expression seems strange. Why would master arbitrarily select the first column index level from the caller?
I think there's a more general fix to be had here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would master arbitrarily select the first column index level from the caller?
I think the idea is to select everything except for the last level, which is the one containing the name of the function e.g. '<lambda>'
I think there's a more general fix to be had here
Seems plausible :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd @jreback thanks for your reviews - have found a cleaner way to write this:
result.columns = result.columns.rename(
[self._selected_obj.columns.name] * result.columns.nlevels
).droplevel(-1)
Before I update the tests according the tests, could you let me know if this seems like the correct patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you not just copy the columns and drop level thereafter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd not sure I understand what you mean - as in,
result.columns.droplevel(-1).rename(self._selected_obj.columns.name)
?
The problem is that then, if we have
(Pdb) result.columns
MultiIndex([('B', '<lambda>'),
('C', '<lambda>'),
('D', '<lambda>')],
)
then
(Pdb) result.columns.droplevel(-1)
Index(['B', 'C', 'D'], dtype='object')
but, if we have
(Pdb) result.columns
MultiIndex([(1, 3, '<lambda>'),
(1, 4, '<lambda>'),
(2, 3, '<lambda>'),
(2, 4, '<lambda>')],
)
then
(Pdb) result.columns.droplevel(-1)
MultiIndex([(1, 3),
(1, 4),
(2, 3),
(2, 4)],
)
In the first case, renaming can be done with e.g. a string:
result.columns.droplevel(-1).rename(self._selected_obj.columns.name)
but in the second case, it needs to be done with a list of strings:
result.columns.droplevel(-1).rename([self._selected_obj.columns.name]*2)
pandas/core/groupby/generic.py
Outdated
@@ -968,7 +968,8 @@ def aggregate(self, func=None, *args, **kwargs): | |||
result = self._aggregate_frame(func) | |||
else: | |||
result.columns = Index( | |||
result.columns.levels[0], name=self._selected_obj.columns.name | |||
[i[:-1] if len(i) > 2 else i[0] for i in result.columns], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe misreading but this entire expression seems strange. Why would master arbitrarily select the first column index level from the caller?
I think there's a more general fix to be had here
bef96e4
to
52adefd
Compare
afc94b7
to
093bcd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grp = df.groupby(np.r_[np.zeros(2), np.ones(2)]) | ||
result = grp.agg(func) | ||
expected_keys = [(1, 3), (1, 4), (2, 3), (2, 4)] | ||
expected = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than parametrize this test can you just construct the expectation from literal values? The parametrized values themselves don't differ enough to be useful but make it tougher to understand what is going on here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd comment plus a questions, looks ok otherwise.
pls merge master as well. |
@WillAyd @jreback thanks for your reviews.
Sure, done. Simplified test a bit, but it still reproduces the error:
Full output
Always. E.g. if our columns are multiindices: if we run
then when we get here,
E.g. if our columns are not multiindices: if we run
then when we get here,
Furthermore, I tried adding in
and all the tests in |
@MarcoGorelli there's a merge conflict in 1.0.2.rst if you could update. |
@TomAugspurger sure, done (I presume the CI failure is unrelated?) |
thanks @MarcoGorelli |
@meeseeksdev backport 1.0.x |
…ith MultiIndex columns breaks with custom function
Something went wrong ... Please have a look at my logs. |
…ndex columns breaks with custom function (#32648) Co-authored-by: Marco Gorelli <[email protected]>
…with custom function (pandas-dev#32040)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff