Skip to content

[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from
Closed

Conversation

christopherzimmerman
Copy link
Contributor

@christopherzimmerman christopherzimmerman commented Oct 21, 2019

This issue needs to be addressed before #28541 can be merged

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic comments; need to review in more detail later

Does this have any backwards compatibility impacts on the API?

@@ -363,11 +363,19 @@ def f(group):
tm.assert_frame_equal(result.loc[key], f(group))


def test_apply_chunk_view():
@pytest.mark.parametrize("as_index", [False, True])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you even need this in the test? Looks like the test itself just remove the True case

@christopherzimmerman
Copy link
Contributor Author

Yes there are impacts. The change to this section guarantees that the grouper column from apply will never be in the output unless as_index is set to False. Before, it was based on whether or not the reduction operation would fail on the grouper column.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change. The only thing we might need to be careful of is that users may have also relied on this buggy behavior in the past (to their credit, we did internally!)

To guard against that can you add a section to the whatsnew for backwards incompatible API changes that spells out the changes in more detail? I think it would be particularly worthwhile to point out the test case that previously worked but now raises an AttributeError

return result1, result2

if as_index:
with pytest.raises(AttributeError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a match= argument here? Should see that fairly often elsewhere in the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this might be something that (while not previously correct) users would rely on; see top message on how best to handle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at the whatsnew. I followed the pattern of another breaking change, but I have never done it before so I'm sure it will need tweaks.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will look soon

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @jreback

@WillAyd WillAyd added this to the 1.0 milestone Oct 24, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your change is ok, but many comments on the test changes.

@WillAyd
Copy link
Member

WillAyd commented Oct 28, 2019

Hmm there is still something weird going on here. On this branch I now see the following behavior:

>>> df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]})
>>> df.groupby("a", as_index=False).shift()
     b
0  NaN
1  1.0
2  NaN
3  3.0
4  NaN
5  5.0
>>> df.groupby("a", as_index=False).apply(lambda x: x.shift())
     a    b
0  NaN  NaN
1  1.0  1.0
2  NaN  NaN
3  2.0  3.0
4  NaN  NaN
5  3.0  5.0

I would expect both of those outputs to be the same. Came across this looking at #13519

We generally might need to step back and think about how the as_index parameter is presented to end users. Right now it seems to work for agg and apply but transform doesn't respect it (to be fair, the documentation states it is only valid for aggregated output).

That might be OK, but at the same time I don't think we want people to start doing apply(lambda x: x.shift()) instead of using .shift() if they do desire output with the grouper column(s) in tact

@jbrockmendel
Copy link
Member

@christopherzimmerman can you rebase

@christopherzimmerman
Copy link
Contributor Author

@WillAyd I tracked down the issue with shift that you mentioned. Is that something that is worth a PR or should the behavior stay as it is now?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for identifying that. I think OK as a follow up PR

@jreback
Copy link
Contributor

jreback commented Nov 20, 2019

can you merge master

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @jreback

@jorisvandenbossche
Copy link
Member

I am not fully convinced this is a "bug fix".

The whole as_index / when is the group key set as index and when not or whether it is included in the group df or not, is a complete mess, for sure.
But I think the possibly biggest change in this PR is in the dataframe that is passed to the applied function. While the whatsnew note mostly speaks about including the group key in the result, users can actually write functions right now that assume the group key is part of this dataframe on which the function operates.

Basically, that you can think of df.groupby('a').apply(f) as doing something like

for name, group in df.groupby('a'):
    res = f(group)
    ....

which is no longer true (as the group passed to f changed).

For reducing functions I agree that you typically don't want the group column to be included in the grouped dataframe on which to apply the function. But in general, for custom functions that are not necessarily reducing functions, this might not be necessarily true.

@jreback jreback removed this from the 1.0 milestone Jan 1, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

can you merge master and we'll take another look (and @jorisvandenbossche comments as well)

@WillAyd
Copy link
Member

WillAyd commented Feb 2, 2020

@christopherzimmerman looks like CI is red and some merge conflicts; any chance you can fix those up?

@christopherzimmerman
Copy link
Contributor Author

@WillAyd I merged and fixed the failing tests. It seems like tests keep getting written that rely on the old behavior of apply.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherzimmerman there is a lot here. Is it possible to do a pre-cursor PR which adds the fixtures & locks down the current behavior, then this change PR will be more obvious what exactly is changing?

@@ -398,6 +398,81 @@ keywords.

df.rename(index={0: 1}, columns={0: 2})


.. _whatsnew_1000.api_breaking.GroupBy.apply:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move to 1.1

.. code-block:: ipython

In [1]: df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show df here

...: columns=['a', 'b'])

In [6]: df.iloc[2, 0] = 5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show df


*Current Behavior*

.. ipython:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break this up into 2 or more examples, its just too hard to follow like this.

meaning:

change1
before
new

change2
before
new

@@ -93,9 +93,16 @@ def f(x):
return x.drop_duplicates("person_name").iloc[0]

result = g.apply(f)
expected = x.iloc[[0, 1]].copy()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if the tests change a lot like this, make a new test



def test_category_as_grouper_keys(as_index):
# Accessing a key that is not in the dataframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait this raises now?

expected.set_index(keys, inplace=True, drop=False)
# GH 28549
# No longer need to reset/set index here
expected = df.groupby(keys).agg(fname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try not to use groupby here and just construct the expected result.

the comment is also not very useful as a future reader won't have context.

can you break this out to a new test

@@ -340,10 +350,10 @@ def test_cython_api2():
tm.assert_frame_equal(result, expected)

# GH 13994
result = df.groupby("A").cumsum(axis=1)
result = df.groupby("A", as_index=False).cumsum(axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test as_indexer=True as well. this seems like a big api change here

# GH 8467
# first show's mutation indicator
# second does not, but should yield the same results
df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})

result1 = df.groupby("key", group_keys=True).apply(lambda x: x[:].key)
result2 = df.groupby("key", group_keys=True).apply(lambda x: x.key)
result1 = df.groupby("key", group_keys=True, as_index=False).apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens with as_index=True?

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The following methods now also correctly output values for unobserved categories when called through ``groupby(..., observed=False)`` (:issue:`17605`)

- :meth:`SeriesGroupBy.count`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are these tested? can you do this as a separate change?

@jreback jreback added the API - Consistency Internal Consistency of API/Behavior label Feb 9, 2020
@jreback jreback requested a review from jbrockmendel February 9, 2020 22:09
@christopherzimmerman
Copy link
Contributor Author

@jreback I'll look into making a PR that is a bit less of an overhaul than this one. Is the behavior that this PR introduces the desired behavior, with apply not containing the grouper columns unless specified? If that is desired, then I'll also need to follow up with PRs to handle shift, resample, and several others so that they all respect the context as well.

I understand what @jorisvandenbossche is saying about this making some major changes, and I just want to make sure it's correct before spending too much more time on it.

@jreback
Copy link
Contributor

jreback commented Feb 10, 2020

@christopherzimmerman I think the behvaior changes in this PR are correct, meaning that these are some long standing bugs. However it would be better to do the non-controversial / easy ones first; the reason is we might need to deprecate some behavior rather than just breaking it. So cleanly separating these changes is key.

thanks for working on this!

@@ -88,10 +88,6 @@ def test_intercept_builtin_sum():
tm.assert_series_equal(result2, expected)


# @pytest.mark.parametrize("f", [max, min, sum])
# def test_builtins_apply(f):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the kind of thing that can be done in a separate PR

@jreback
Copy link
Contributor

jreback commented Mar 15, 2020

@christopherzimmerman happy to have a lesser scope PR to just fix the problem.

@@ -1246,6 +1253,16 @@ def test_get_nonexistent_category():
# Accessing a Category that is not in the dataframe
df = pd.DataFrame({"var": ["a", "a", "b", "b"], "val": range(4)})
with pytest.raises(KeyError, match="'vau'"):
df.groupby("var").apply(
lambda rows: pd.DataFrame({"val": [rows.iloc[-1]["vau"]]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: if its the .apply that raises, let's do the df.groupby outside of the pytest.raises block, e.g. gb = df.groupby("var")

@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

@christopherzimmerman if you can push a more limited scope i think we would like to fix these bugs

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

closing this as stale. as indicated above we would welcome this type of change, but we need to do it incrementally to avoid breaking the world.

@jreback jreback closed this Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many groupby tests depend on a bug in DataFrameGroupBy.apply
5 participants