Skip to content

BUG: Incorrect index shape when using a user-defined function for aggregating a grouped series with object-typed index. #40835

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Apr 15, 2021
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
5ded786
Add failing test.
DriesSchaumont Apr 2, 2021
30cd16e
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 7, 2021
e97b413
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 8, 2021
c5e1cf7
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 8, 2021
9ccb324
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 8, 2021
04d61b5
Dont use fastpath for series whith object index.
DriesSchaumont Apr 8, 2021
d4d9fb5
Add comment.
DriesSchaumont Apr 9, 2021
307bbe5
Add whatsnew
DriesSchaumont Apr 9, 2021
ace5f81
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 9, 2021
bb7f09e
Move test.
DriesSchaumont Apr 9, 2021
4f31837
Check test expected result.
DriesSchaumont Apr 9, 2021
db9b29b
Move whatsnew
DriesSchaumont Apr 9, 2021
fa4291a
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 9, 2021
b0dfbcd
Adjustment for review
DriesSchaumont Apr 9, 2021
4ddcc12
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 9, 2021
a21be6f
New solution.
DriesSchaumont Apr 12, 2021
7a4a793
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 12, 2021
4bf3f20
Fix len error.
DriesSchaumont Apr 12, 2021
0c16e6c
Fix index caching.
DriesSchaumont Apr 12, 2021
f457dff
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 12, 2021
3eb7b79
Fix index caching (2)
DriesSchaumont Apr 13, 2021
d608b5b
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 13, 2021
86db870
Styling and remove stray print statement
DriesSchaumont Apr 13, 2021
ded8433
Adjustments for review
DriesSchaumont Apr 13, 2021
24a1344
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 13, 2021
5cca8d2
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 13, 2021
337edbd
Merge remote-tracking branch 'upstream/master' into fix-40014
DriesSchaumont Apr 14, 2021
b89eee0
Add issue number, change almost_equal to assert_equal
DriesSchaumont Apr 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -780,6 +780,7 @@ Groupby/resample/rolling
- :class:`core.window.ewm.ExponentialMovingWindow` now raises a ``NotImplementedError`` when specifying ``times`` with ``adjust=False`` due to an incorrect calculation (:issue:`40098`)
- Bug in :meth:`Series.asfreq` and :meth:`DataFrame.asfreq` dropping rows when the index is not sorted (:issue:`39805`)
- Bug in aggregation functions for :class:`DataFrame` not respecting ``numeric_only`` argument when ``level`` keyword was given (:issue:`40660`)
- Bug in groupby aggregation when using an object-typed :class:`Index` (issue:`40014`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be more specific here? as a user i am not sure what i am to do with this statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a 'for example df.groupby(....)......` or similar, in particular this is only for a user supplied function (e.g. via .apply right)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, hope this is more clear?

- Bug in :class:`core.window.RollingGroupby` where ``as_index=False`` argument in ``groupby`` was ignored (:issue:`39433`)

Reshaping
Expand Down
8 changes: 6 additions & 2 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -738,8 +738,12 @@ def agg_series(self, obj: Series, func: F):
# TODO: can we get a performant workaround for EAs backed by ndarray?
return self._aggregate_series_pure_python(obj, func)

elif obj.index._has_complex_internals:
# Preempt TypeError in _aggregate_series_fast
elif obj.index._has_complex_internals or obj.index.dtype == "object":
# (complex internals): Preempt TypeError in _aggregate_series_fast
# (object index dtype): _aggregate_series_fast raises TypeError
# when applying func because the group indiced become malformatted:
# correct indices are in group.index._index_data, but not in
# group.index._data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On master, replacing the lamba in the linked issue with lambda x: 1 does not raise and gives the correct answer. It seems to me that just having an object dtype is not root cause here, and we will be hurting performance for this special case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will keep looking for the root cause and try to provide a fix. I will mark this PR as a draft in the meantime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach I have updated this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Can you just check on my example in the issue and report back? Do you understand why the "> 0" was causing it to fail? It's still not clear to me from the diff in this PR.

I think what happens here is that x.gt(0) calls _cmp_method and constructs a new Series through _construct_result. Note that the index length check happens when a new Series is constructed. x.sum() on the other hand does not construct a new series. The return type is numpy.int64. So it seems that using any function that constructs a new Series will fail, while the other will not. But the index state in the user-defined function is always corrupt.
A good test would be to use

import pandas.core.common as com

def foo(x):
    com.require_length_match(x.values, x.index)
    return x

return self._aggregate_series_pure_python(obj, func)

try:
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/groupby/aggregate/test_aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -1221,3 +1221,16 @@ def test_aggregate_numeric_object_dtype():
{"key": ["A", "B"], "col1": ["a", "c"], "col2": [0, 2]}
).set_index("key")
tm.assert_frame_equal(result, expected)


def test_groupby_index_object_dtype():
# GH 40014
df = DataFrame({"c0": ["x", "x", "x"], "c1": ["x", "x", "y"], "p": [0, 1, 2]})
df.index = df.index.astype("O")
grouped = df.groupby(["c0", "c1"])
res = grouped.p.agg(lambda x: all(x > 0))
expected_index = MultiIndex.from_tuples(
[("x", "x"), ("x", "y")], names=("c0", "c1")
)
expected = Series([False, True], index=expected_index, name="p")
tm.assert_series_equal(res, expected)