Skip to content

BUG: Incorrect index shape when using a user-defined function for aggregating a grouped series with object-typed index. #40835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Apr 15, 2021

Conversation

DriesSchaumont
Copy link
Member

@DriesSchaumont DriesSchaumont commented Apr 8, 2021

Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr! Some quick comments, plus can you please add a whatsnew?

@mzeitlin11 mzeitlin11 added Bug Groupby Index Related to the Index class or subclasses labels Apr 8, 2021
@DriesSchaumont
Copy link
Member Author

Thanks for the pr! Some quick comments, plus can you please add a whatsnew?

Done. Added whatsnew targetting 1.2.4, but let me know if this is not the case.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

@jreback jreback added this to the 1.3 milestone Apr 9, 2021
Comment on lines 742 to 746
# (complex internals): Preempt TypeError in _aggregate_series_fast
# (object index dtype): _aggregate_series_fast raises TypeError
# when applying func because the group indiced become malformatted:
# correct indices are in group.index._index_data, but not in
# group.index._data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master, replacing the lamba in the linked issue with lambda x: 1 does not raise and gives the correct answer. It seems to me that just having an object dtype is not root cause here, and we will be hurting performance for this special case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will keep looking for the root cause and try to provide a fix. I will mark this PR as a draft in the meantime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach I have updated this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Can you just check on my example in the issue and report back? Do you understand why the "> 0" was causing it to fail? It's still not clear to me from the diff in this PR.

I think what happens here is that x.gt(0) calls _cmp_method and constructs a new Series through _construct_result. Note that the index length check happens when a new Series is constructed. x.sum() on the other hand does not construct a new series. The return type is numpy.int64. So it seems that using any function that constructs a new Series will fail, while the other will not. But the index state in the user-defined function is always corrupt.
A good test would be to use

import pandas.core.common as com

def foo(x):
    com.require_length_match(x.values, x.index)
    return x

@DriesSchaumont DriesSchaumont marked this pull request as draft April 12, 2021 09:23
@DriesSchaumont
Copy link
Member Author

I am opening this PR for review again. I have added another test, not sure on which one to keep. Open to any suggestions.

@DriesSchaumont DriesSchaumont marked this pull request as ready for review April 13, 2021 10:46
@@ -781,6 +781,7 @@ Groupby/resample/rolling
- :class:`core.window.ewm.ExponentialMovingWindow` now raises a ``NotImplementedError`` when specifying ``times`` with ``adjust=False`` due to an incorrect calculation (:issue:`40098`)
- Bug in :meth:`Series.asfreq` and :meth:`DataFrame.asfreq` dropping rows when the index is not sorted (:issue:`39805`)
- Bug in aggregation functions for :class:`DataFrame` not respecting ``numeric_only`` argument when ``level`` keyword was given (:issue:`40660`)
- Bug in groupby aggregation when using an object-typed :class:`Index` (issue:`40014`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be more specific here? as a user i am not sure what i am to do with this statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a 'for example df.groupby(....)......` or similar, in particular this is only for a user supplied function (e.g. via .apply right)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, hope this is more clear?

@DriesSchaumont DriesSchaumont changed the title BUG: Do not use fast path to aggregate series with boolean-type index. BUG: Incorrect index shape when using a user-defined function for aggregating a grouped series with object-typed index. Apr 13, 2021
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Can you just check on my example in the issue and report back? Do you understand why the "> 0" was causing it to fail? It's still not clear to me from the diff in this PR.

@DriesSchaumont
Copy link
Member Author

This looks great! Can you just check on my example in the issue and report back? Do you understand why the "> 0" was causing it to fail? It's still not clear to me from the diff in this PR.

I addressed your question here: #40014 (comment)

@jreback
Copy link
Contributor

jreback commented Apr 14, 2021

lgtm. cc @jbrockmendel if any comments

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback
Copy link
Contributor

jreback commented Apr 14, 2021

@rhshadrach over to you

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach merged commit a25198c into pandas-dev:master Apr 15, 2021
@rhshadrach
Copy link
Member

Thanks @DriesSchaumont, very nice!

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: "groupby" failed for dataframe with object-type index
5 participants