-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Groupby.apply wasn't allowing for functions which return lists #31456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1d1904f
to
ed45390
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think also need to run some perf checks to make sure this isn't super expensive.
pandas/_libs/reduction.pyx
Outdated
@@ -499,7 +499,9 @@ def apply_frame_axis0(object frame, object f, object names, | |||
# `piece` might not have an index, could be e.g. an int | |||
pass | |||
|
|||
if not is_scalar(piece): | |||
if isinstance(piece, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the correct patch,
instead lets change line 507 (in revised code) to be
piece = deepcopy(piece)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the correct patch,
instead lets change line 507 (in revised code) to be
piece = deepcopy(piece)
This causes some failures, e.g.
$ pytest pandas/tests/groupby/test_groupby.py::test_no_mutate_but_looks_like
which doesn't fail with
if isinstance(piece, list):
piece = deepcopy(piece)
elif not is_scalar(piece):
# Need to copy data to avoid appending references
if hasattr(piece, "copy"):
piece = piece.copy(deep='all')
else:
piece = copy(piece)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli what is the difference that this causes in that test? Would it return one of the results completely as a list instead of exploding to a series?
Returning a list from a groupby op is thorny to begin with so I think should streamline logic as much as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd here's the difference:
>>> df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
>>> result1 = df.groupby("key", group_keys=True).apply(lambda x: x[:].key)
>>> result2 = df.groupby("key", group_keys=True).apply(lambda x: x.key)
>>> result1
key
1 0 1
1 1
2 1
2 3 2
4 2
5 2
3 6 3
7 3
8 3
>>> result2
key 0 1 2
key
1 1 1 1
2 2 2 2
3 3 3 3
Returning a list from a groupby op is thorny to begin with so I think should streamline logic as much as possible
OK, I'll think of a workaround (but I realise that you'd like to get a v1.0.1 release out very soon so if you think this'll take me too long please do feel free to take over)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat confused here. If I use this, the test case passes:
pd._testing.assert_series_equal(deepcopy(piece), piece.copy(deep="all"))
piece = piece.copy(deep="all")
but if I use this, it doesn't!
pd._testing.assert_series_equal(deepcopy(piece), piece.copy(deep="all"))
piece = deepcopy(piece)
will look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsaxton see here - I share your puzzlement :) Will look into Series.copy
tomorrow
7aa36d9
to
6369f76
Compare
NOTEThe following refers to a commit with line 507 (in revised code) changed to:
(so, without the
Sure, here's the output from a run I did:
EDITseems like there's some test failures though |
Performance results with current commit:
|
6369f76
to
8ede846
Compare
pandas/_libs/reduction.pyx
Outdated
@@ -499,7 +499,9 @@ def apply_frame_axis0(object frame, object f, object names, | |||
# `piece` might not have an index, could be e.g. an int | |||
pass | |||
|
|||
if not is_scalar(piece): | |||
if isinstance(piece, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to special case lists specifically here? How about special casing only those times when copy
is known to have the deep
keyword for piece.copy(deep="all")
and letting everything else use a deepcopy
? (eg., something like df.groupby("a").apply(lambda x: set(x.index))
would still raise here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about special casing only those times when copy is known to have the deep keyword
How - by using inspect
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had been thinking checking specifically for Series, DataFrame, or Index and trying to deepcopy
everything else, but there may be a "prettier" way of approaching it. I remember getting the exact same test failure you're seeing now and am still puzzled by it.
8ede846
to
a198cc8
Compare
pandas/tests/groupby/test_apply.py
Outdated
def test_apply_function_returns_list(): | ||
# GH 31441 | ||
df = pd.DataFrame(["A", "A", "B", "B"], columns=["groups"]) | ||
result = df.groupby("groups").apply(lambda x: x.index.to_list()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add tests for other return types as well (e.g., set or tuple)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good suggestion, thanks!
@pytest.mark.parametrize( | ||
"function, expected_values", | ||
[ | ||
(lambda x: x.index.to_list(), [[0, 1], [2, 3]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to skip the conversion to list in the remaining test cases
thanks @MarcoGorelli |
…functions which return lists
…which return lists (#31541) Co-authored-by: Marco Gorelli <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff