[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131

christopherzimmerman · 2019-10-21T15:27:45Z

closes Many groupby tests depend on a bug in DataFrameGroupBy.apply #28549
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This issue needs to be addressed before #28541 can be merged

WillAyd

Stylistic comments; need to review in more detail later

Does this have any backwards compatibility impacts on the API?

pandas/core/groupby/groupby.py

WillAyd · 2019-10-21T17:05:59Z

pandas/tests/groupby/test_apply.py

@@ -363,11 +363,19 @@ def f(group):
        tm.assert_frame_equal(result.loc[key], f(group))


-def test_apply_chunk_view():
+@pytest.mark.parametrize("as_index", [False, True])


Do you even need this in the test? Looks like the test itself just remove the True case

pandas/tests/groupby/test_apply.py

christopherzimmerman · 2019-10-21T17:18:40Z

Yes there are impacts. The change to this section guarantees that the grouper column from apply will never be in the output unless as_index is set to False. Before, it was based on whether or not the reduction operation would fail on the grouper column.

WillAyd

This is a great change. The only thing we might need to be careful of is that users may have also relied on this buggy behavior in the past (to their credit, we did internally!)

To guard against that can you add a section to the whatsnew for backwards incompatible API changes that spells out the changes in more detail? I think it would be particularly worthwhile to point out the test case that previously worked but now raises an AttributeError

WillAyd · 2019-10-22T23:18:01Z

pandas/tests/groupby/test_groupby.py

+        return result1, result2
+
+    if as_index:
+        with pytest.raises(AttributeError):


Can you add a match= argument here? Should see that fairly often elsewhere in the code

Also this might be something that (while not previously correct) users would rely on; see top message on how best to handle

I took a stab at the whatsnew. I followed the pattern of another breaking change, but I have never done it before so I'm sure it will need tweaks.

jreback

will look soon

pandas/tests/groupby/conftest.py

doc/source/whatsnew/v1.0.0.rst

WillAyd

lgtm @jreback

jreback

your change is ok, but many comments on the test changes.

doc/source/whatsnew/v1.0.0.rst

pandas/tests/groupby/test_function.py

pandas/tests/groupby/test_groupby.py

pandas/tests/groupby/test_transform.py

…an/pandas into apply_context

WillAyd · 2019-10-28T21:43:54Z

Hmm there is still something weird going on here. On this branch I now see the following behavior:

>>> df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]})
>>> df.groupby("a", as_index=False).shift()
     b
0  NaN
1  1.0
2  NaN
3  3.0
4  NaN
5  5.0
>>> df.groupby("a", as_index=False).apply(lambda x: x.shift())
     a    b
0  NaN  NaN
1  1.0  1.0
2  NaN  NaN
3  2.0  3.0
4  NaN  NaN
5  3.0  5.0

I would expect both of those outputs to be the same. Came across this looking at #13519

We generally might need to step back and think about how the as_index parameter is presented to end users. Right now it seems to work for agg and apply but transform doesn't respect it (to be fair, the documentation states it is only valid for aggregated output).

That might be OK, but at the same time I don't think we want people to start doing apply(lambda x: x.shift()) instead of using .shift() if they do desire output with the grouper column(s) in tact

jbrockmendel · 2019-10-31T23:26:44Z

@christopherzimmerman can you rebase

…an/pandas into apply_context

christopherzimmerman · 2019-11-07T14:18:57Z

@WillAyd I tracked down the issue with shift that you mentioned. Is that something that is worth a PR or should the behavior stay as it is now?

WillAyd

OK thanks for identifying that. I think OK as a follow up PR

doc/source/whatsnew/v1.0.0.rst

jreback · 2019-11-20T13:47:55Z

can you merge master

WillAyd

lgtm @jreback

jorisvandenbossche · 2019-12-09T23:38:45Z

I am not fully convinced this is a "bug fix".

The whole as_index / when is the group key set as index and when not or whether it is included in the group df or not, is a complete mess, for sure.
But I think the possibly biggest change in this PR is in the dataframe that is passed to the applied function. While the whatsnew note mostly speaks about including the group key in the result, users can actually write functions right now that assume the group key is part of this dataframe on which the function operates.

Basically, that you can think of df.groupby('a').apply(f) as doing something like

for name, group in df.groupby('a'):
    res = f(group)
    ....

which is no longer true (as the group passed to f changed).

For reducing functions I agree that you typically don't want the group column to be included in the grouped dataframe on which to apply the function. But in general, for custom functions that are not necessarily reducing functions, this might not be necessarily true.

jreback · 2020-01-01T17:23:36Z

can you merge master and we'll take another look (and @jorisvandenbossche comments as well)

WillAyd · 2020-02-02T01:04:57Z

@christopherzimmerman looks like CI is red and some merge conflicts; any chance you can fix those up?

christopherzimmerman · 2020-02-04T14:48:15Z

@WillAyd I merged and fixed the failing tests. It seems like tests keep getting written that rely on the old behavior of apply.

jreback

@christopherzimmerman there is a lot here. Is it possible to do a pre-cursor PR which adds the fixtures & locks down the current behavior, then this change PR will be more obvious what exactly is changing?

jreback · 2020-02-09T22:01:06Z

doc/source/whatsnew/v1.0.0.rst

@@ -398,6 +398,81 @@ keywords.

   df.rename(index={0: 1}, columns={0: 2})

+
+.. _whatsnew_1000.api_breaking.GroupBy.apply:


need to move to 1.1

jreback · 2020-02-09T22:01:21Z

doc/source/whatsnew/v1.0.0.rst

+.. code-block:: ipython
+
+    In [1]: df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]})
+


show df here

jreback · 2020-02-09T22:01:38Z

doc/source/whatsnew/v1.0.0.rst

+       ...:                   columns=['a', 'b'])
+
+    In [6]: df.iloc[2, 0] = 5
+


jreback · 2020-02-09T22:02:25Z

doc/source/whatsnew/v1.0.0.rst

+
+*Current Behavior*
+
+.. ipython:: python


break this up into 2 or more examples, its just too hard to follow like this.

meaning:

change1
before
new

change2
before
new

jreback · 2020-02-09T22:04:17Z

pandas/tests/groupby/test_categorical.py

@@ -93,9 +93,16 @@ def f(x):
        return x.drop_duplicates("person_name").iloc[0]

    result = g.apply(f)
-    expected = x.iloc[[0, 1]].copy()
+


so if the tests change a lot like this, make a new test

jreback · 2020-02-09T22:04:48Z

pandas/tests/groupby/test_categorical.py

+
+
+def test_category_as_grouper_keys(as_index):
+    # Accessing a key that is not in the dataframe


wait this raises now?

jreback · 2020-02-09T22:06:22Z

pandas/tests/groupby/test_function.py

-        expected.set_index(keys, inplace=True, drop=False)
+        # GH 28549
+        # No longer need to reset/set index here
+        expected = df.groupby(keys).agg(fname)


can you try not to use groupby here and just construct the expected result.

the comment is also not very useful as a future reader won't have context.

can you break this out to a new test

jreback · 2020-02-09T22:06:34Z

pandas/tests/groupby/test_function.py

@@ -340,10 +350,10 @@ def test_cython_api2():
    tm.assert_frame_equal(result, expected)

    # GH 13994
-    result = df.groupby("A").cumsum(axis=1)
+    result = df.groupby("A", as_index=False).cumsum(axis=1)


can you test as_indexer=True as well. this seems like a big api change here

jreback · 2020-02-09T22:07:10Z

pandas/tests/groupby/test_groupby.py

    # GH 8467
    # first show's mutation indicator
    # second does not, but should yield the same results
    df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})

-    result1 = df.groupby("key", group_keys=True).apply(lambda x: x[:].key)
-    result2 = df.groupby("key", group_keys=True).apply(lambda x: x.key)
+    result1 = df.groupby("key", group_keys=True, as_index=False).apply(


what happens with as_index=True?

jreback · 2020-02-09T22:07:59Z

doc/source/whatsnew/v1.0.0.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The following methods now also correctly output values for unobserved categories when called through ``groupby(..., observed=False)`` (:issue:`17605`)
+
+- :meth:`SeriesGroupBy.count`


where are these tested? can you do this as a separate change?

christopherzimmerman · 2020-02-10T14:36:11Z

@jreback I'll look into making a PR that is a bit less of an overhaul than this one. Is the behavior that this PR introduces the desired behavior, with apply not containing the grouper columns unless specified? If that is desired, then I'll also need to follow up with PRs to handle shift, resample, and several others so that they all respect the context as well.

I understand what @jorisvandenbossche is saying about this making some major changes, and I just want to make sure it's correct before spending too much more time on it.

jreback · 2020-02-10T14:57:27Z

@christopherzimmerman I think the behvaior changes in this PR are correct, meaning that these are some long standing bugs. However it would be better to do the non-controversial / easy ones first; the reason is we might need to deprecate some behavior rather than just breaking it. So cleanly separating these changes is key.

thanks for working on this!

jbrockmendel · 2020-02-11T19:12:50Z

pandas/tests/groupby/test_function.py

@@ -88,10 +88,6 @@ def test_intercept_builtin_sum():
    tm.assert_series_equal(result2, expected)


-# @pytest.mark.parametrize("f", [max, min, sum])
-# def test_builtins_apply(f):


this is the kind of thing that can be done in a separate PR

jreback · 2020-03-15T00:57:54Z

@christopherzimmerman happy to have a lesser scope PR to just fix the problem.

jbrockmendel · 2020-03-24T18:32:05Z

pandas/tests/groupby/test_categorical.py

@@ -1246,6 +1253,16 @@ def test_get_nonexistent_category():
    # Accessing a Category that is not in the dataframe
    df = pd.DataFrame({"var": ["a", "a", "b", "b"], "val": range(4)})
    with pytest.raises(KeyError, match="'vau'"):
+        df.groupby("var").apply(
+            lambda rows: pd.DataFrame({"val": [rows.iloc[-1]["vau"]]})


nitpick: if its the .apply that raises, let's do the df.groupby outside of the pytest.raises block, e.g. gb = df.groupby("var")

jreback · 2020-04-10T22:01:02Z

@christopherzimmerman if you can push a more limited scope i think we would like to fix these bugs

jreback · 2020-06-14T15:37:41Z

closing this as stale. as indicated above we would welcome this type of change, but we need to do it incrementally to avoid breaking the world.

christopherzimmerman added 2 commits October 21, 2019 10:25

Modifed tests and fixed bug in groupby/apply

78de38c

fixed resample docstring

63677e7

WillAyd requested changes Oct 21, 2019

View reviewed changes

WillAyd added the Groupby label Oct 21, 2019

Added fixture and cleaned up tests

1d99d9c

WillAyd requested changes Oct 22, 2019

View reviewed changes

Whatsnew and added match to test

98bc673

jreback requested changes Oct 23, 2019

View reviewed changes

pandas/tests/groupby/conftest.py Show resolved Hide resolved

WillAyd requested changes Oct 23, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

conftest docstring and whatsnew example

fbf3202

WillAyd approved these changes Oct 24, 2019

View reviewed changes

WillAyd added this to the 1.0 milestone Oct 24, 2019

jreback requested changes Oct 24, 2019

View reviewed changes

christopherzimmerman added 4 commits October 28, 2019 10:53

Changes to tests and whatsnew

947a5bd

Merge branch 'master' into apply_context

8c3efb0

Had to parameterize the test because of group keys

fa21e29

Merge branch 'apply_context' of https://github.com/christopherzimmerm…

a0a9aa5

…an/pandas into apply_context

christopherzimmerman added 5 commits October 31, 2019 19:01

Merge branch 'master' into apply_context

7070169

Update test_transform.py

76815f1

Update test_groupby.py

6c49a16

Updated syntax in ipython block

8a4c1f8

Merge branch 'apply_context' of https://github.com/christopherzimmerm…

ccf940d

…an/pandas into apply_context

WillAyd requested changes Nov 7, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

christopherzimmerman added 2 commits November 20, 2019 08:18

Merged from master

b7d056d

Indent

cfacfc1

jreback mentioned this pull request Nov 22, 2019

BUG: add reset logic for Grouper if new obj is passed in (#26564) #29800

Closed

5 tasks

WillAyd approved these changes Dec 9, 2019

View reviewed changes

jreback removed this from the 1.0 milestone Jan 1, 2020

christopherzimmerman mentioned this pull request Jan 2, 2020

BUG: Groupby selection context not being properly reset #28541

Closed

5 tasks

Merge remote-tracking branch 'upstream/master' into apply_context

c384c09

christopherzimmerman added 2 commits February 3, 2020 10:23

Merged upstream

91d1931

More tests changed with bad apply behavior

83be029

jreback requested changes Feb 9, 2020

View reviewed changes

jreback added the API - Consistency Internal Consistency of API/Behavior label Feb 9, 2020

jreback requested a review from jbrockmendel February 9, 2020 22:09

jbrockmendel reviewed Feb 11, 2020

View reviewed changes

jbrockmendel reviewed Mar 24, 2020

View reviewed changes

rhshadrach mentioned this pull request May 5, 2020

BUG: groupby with as_index=False shouldn't modify grouping columns #34012

Merged

6 tasks

jreback closed this Jun 14, 2020

		@@ -398,6 +398,81 @@ keywords.

		df.rename(index={0: 1}, columns={0: 2})


		.. _whatsnew_1000.api_breaking.GroupBy.apply:

		.. code-block:: ipython

		In [1]: df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]})



		def test_category_as_grouper_keys(as_index):
		# Accessing a key that is not in the dataframe

[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131

[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131

Conversation

christopherzimmerman commented Oct 21, 2019 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherzimmerman commented Oct 21, 2019

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented Oct 28, 2019 • edited Loading

jbrockmendel commented Oct 31, 2019

christopherzimmerman commented Nov 7, 2019

WillAyd left a comment

Choose a reason for hiding this comment

jreback commented Nov 20, 2019

WillAyd left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 9, 2019

jreback commented Jan 1, 2020

WillAyd commented Feb 2, 2020

christopherzimmerman commented Feb 4, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherzimmerman commented Feb 10, 2020

jreback commented Feb 10, 2020

Choose a reason for hiding this comment

jreback commented Mar 15, 2020

Choose a reason for hiding this comment

jreback commented Apr 10, 2020

jreback commented Jun 14, 2020

christopherzimmerman commented Oct 21, 2019 •

edited

Loading

WillAyd commented Oct 28, 2019 •

edited

Loading