-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Incorrect index shape when using a user-defined function for aggregating a grouped series with object-typed index. #40835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DriesSchaumont
commented
Apr 8, 2021
•
edited
Loading
edited
- closes BUG: "groupby" failed for dataframe with object-type index #40014
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pr! Some quick comments, plus can you please add a whatsnew?
Done. Added whatsnew targetting 1.2.4, but let me know if this is not the case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment
pandas/core/groupby/ops.py
Outdated
# (complex internals): Preempt TypeError in _aggregate_series_fast | ||
# (object index dtype): _aggregate_series_fast raises TypeError | ||
# when applying func because the group indiced become malformatted: | ||
# correct indices are in group.index._index_data, but not in | ||
# group.index._data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On master, replacing the lamba in the linked issue with lambda x: 1
does not raise and gives the correct answer. It seems to me that just having an object dtype is not root cause here, and we will be hurting performance for this special case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will keep looking for the root cause and try to provide a fix. I will mark this PR as a draft in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach I have updated this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Can you just check on my example in the issue and report back? Do you understand why the "> 0" was causing it to fail? It's still not clear to me from the diff in this PR.
I think what happens here is that x.gt(0)
calls _cmp_method
and constructs a new Series
through _construct_result
. Note that the index length check happens when a new Series
is constructed. x.sum()
on the other hand does not construct a new series. The return type is numpy.int64
. So it seems that using any function that constructs a new Series
will fail, while the other will not. But the index state in the user-defined function is always corrupt.
A good test would be to use
import pandas.core.common as com
def foo(x):
com.require_length_match(x.values, x.index)
return x
52379a8
to
86db870
Compare
I am opening this PR for review again. I have added another test, not sure on which one to keep. Open to any suggestions. |
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -781,6 +781,7 @@ Groupby/resample/rolling | |||
- :class:`core.window.ewm.ExponentialMovingWindow` now raises a ``NotImplementedError`` when specifying ``times`` with ``adjust=False`` due to an incorrect calculation (:issue:`40098`) | |||
- Bug in :meth:`Series.asfreq` and :meth:`DataFrame.asfreq` dropping rows when the index is not sorted (:issue:`39805`) | |||
- Bug in aggregation functions for :class:`DataFrame` not respecting ``numeric_only`` argument when ``level`` keyword was given (:issue:`40660`) | |||
- Bug in groupby aggregation when using an object-typed :class:`Index` (issue:`40014`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you be more specific here? as a user i am not sure what i am to do with this statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a 'for example df.groupby(....)......` or similar, in particular this is only for a user supplied function (e.g. via .apply right)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, hope this is more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Can you just check on my example in the issue and report back? Do you understand why the "> 0" was causing it to fail? It's still not clear to me from the diff in this PR.
I addressed your question here: #40014 (comment) |
lgtm. cc @jbrockmendel if any comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@rhshadrach over to you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @DriesSchaumont, very nice! |
…regating a grouped series with object-typed index. (pandas-dev#40835)
…regating a grouped series with object-typed index. (pandas-dev#40835)
…regating a grouped series with object-typed index. (pandas-dev#40835)