-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic comments; need to review in more detail later
Does this have any backwards compatibility impacts on the API?
pandas/tests/groupby/test_apply.py
Outdated
@@ -363,11 +363,19 @@ def f(group): | |||
tm.assert_frame_equal(result.loc[key], f(group)) | |||
|
|||
|
|||
def test_apply_chunk_view(): | |||
@pytest.mark.parametrize("as_index", [False, True]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you even need this in the test? Looks like the test itself just remove the True
case
Yes there are impacts. The change to this section guarantees that the grouper column from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great change. The only thing we might need to be careful of is that users may have also relied on this buggy behavior in the past (to their credit, we did internally!)
To guard against that can you add a section to the whatsnew for backwards incompatible API changes that spells out the changes in more detail? I think it would be particularly worthwhile to point out the test case that previously worked but now raises an AttributeError
pandas/tests/groupby/test_groupby.py
Outdated
return result1, result2 | ||
|
||
if as_index: | ||
with pytest.raises(AttributeError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a match=
argument here? Should see that fairly often elsewhere in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this might be something that (while not previously correct) users would rely on; see top message on how best to handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a stab at the whatsnew. I followed the pattern of another breaking change, but I have never done it before so I'm sure it will need tweaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will look soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @jreback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your change is ok, but many comments on the test changes.
Hmm there is still something weird going on here. On this branch I now see the following behavior: >>> df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]})
>>> df.groupby("a", as_index=False).shift()
b
0 NaN
1 1.0
2 NaN
3 3.0
4 NaN
5 5.0
>>> df.groupby("a", as_index=False).apply(lambda x: x.shift())
a b
0 NaN NaN
1 1.0 1.0
2 NaN NaN
3 2.0 3.0
4 NaN NaN
5 3.0 5.0 I would expect both of those outputs to be the same. Came across this looking at #13519 We generally might need to step back and think about how the That might be OK, but at the same time I don't think we want people to start doing |
@christopherzimmerman can you rebase |
@WillAyd I tracked down the issue with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for identifying that. I think OK as a follow up PR
can you merge master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @jreback
I am not fully convinced this is a "bug fix". The whole Basically, that you can think of
which is no longer true (as the For reducing functions I agree that you typically don't want the group column to be included in the grouped dataframe on which to apply the function. But in general, for custom functions that are not necessarily reducing functions, this might not be necessarily true. |
can you merge master and we'll take another look (and @jorisvandenbossche comments as well) |
@christopherzimmerman looks like CI is red and some merge conflicts; any chance you can fix those up? |
@WillAyd I merged and fixed the failing tests. It seems like tests keep getting written that rely on the old behavior of apply. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherzimmerman there is a lot here. Is it possible to do a pre-cursor PR which adds the fixtures & locks down the current behavior, then this change PR will be more obvious what exactly is changing?
@@ -398,6 +398,81 @@ keywords. | |||
|
|||
df.rename(index={0: 1}, columns={0: 2}) | |||
|
|||
|
|||
.. _whatsnew_1000.api_breaking.GroupBy.apply: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to move to 1.1
.. code-block:: ipython | ||
|
||
In [1]: df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show df here
...: columns=['a', 'b']) | ||
|
||
In [6]: df.iloc[2, 0] = 5 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show df
|
||
*Current Behavior* | ||
|
||
.. ipython:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break this up into 2 or more examples, its just too hard to follow like this.
meaning:
change1
before
new
change2
before
new
@@ -93,9 +93,16 @@ def f(x): | |||
return x.drop_duplicates("person_name").iloc[0] | |||
|
|||
result = g.apply(f) | |||
expected = x.iloc[[0, 1]].copy() | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if the tests change a lot like this, make a new test
|
||
|
||
def test_category_as_grouper_keys(as_index): | ||
# Accessing a key that is not in the dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait this raises now?
expected.set_index(keys, inplace=True, drop=False) | ||
# GH 28549 | ||
# No longer need to reset/set index here | ||
expected = df.groupby(keys).agg(fname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try not to use groupby here and just construct the expected result.
the comment is also not very useful as a future reader won't have context.
can you break this out to a new test
@@ -340,10 +350,10 @@ def test_cython_api2(): | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
# GH 13994 | |||
result = df.groupby("A").cumsum(axis=1) | |||
result = df.groupby("A", as_index=False).cumsum(axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test as_indexer=True as well. this seems like a big api change here
# GH 8467 | ||
# first show's mutation indicator | ||
# second does not, but should yield the same results | ||
df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)}) | ||
|
||
result1 = df.groupby("key", group_keys=True).apply(lambda x: x[:].key) | ||
result2 = df.groupby("key", group_keys=True).apply(lambda x: x.key) | ||
result1 = df.groupby("key", group_keys=True, as_index=False).apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens with as_index=True?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
The following methods now also correctly output values for unobserved categories when called through ``groupby(..., observed=False)`` (:issue:`17605`) | ||
|
||
- :meth:`SeriesGroupBy.count` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are these tested? can you do this as a separate change?
@jreback I'll look into making a PR that is a bit less of an overhaul than this one. Is the behavior that this PR introduces the desired behavior, with I understand what @jorisvandenbossche is saying about this making some major changes, and I just want to make sure it's correct before spending too much more time on it. |
@christopherzimmerman I think the behvaior changes in this PR are correct, meaning that these are some long standing bugs. However it would be better to do the non-controversial / easy ones first; the reason is we might need to deprecate some behavior rather than just breaking it. So cleanly separating these changes is key. thanks for working on this! |
@@ -88,10 +88,6 @@ def test_intercept_builtin_sum(): | |||
tm.assert_series_equal(result2, expected) | |||
|
|||
|
|||
# @pytest.mark.parametrize("f", [max, min, sum]) | |||
# def test_builtins_apply(f): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the kind of thing that can be done in a separate PR
@christopherzimmerman happy to have a lesser scope PR to just fix the problem. |
@@ -1246,6 +1253,16 @@ def test_get_nonexistent_category(): | |||
# Accessing a Category that is not in the dataframe | |||
df = pd.DataFrame({"var": ["a", "a", "b", "b"], "val": range(4)}) | |||
with pytest.raises(KeyError, match="'vau'"): | |||
df.groupby("var").apply( | |||
lambda rows: pd.DataFrame({"val": [rows.iloc[-1]["vau"]]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: if its the .apply
that raises, let's do the df.groupby outside of the pytest.raises block, e.g. gb = df.groupby("var")
@christopherzimmerman if you can push a more limited scope i think we would like to fix these bugs |
closing this as stale. as indicated above we would welcome this type of change, but we need to do it incrementally to avoid breaking the world. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This issue needs to be addressed before #28541 can be merged