BUG: Bug fix for GH48567 #48772

JanVHII · 2022-09-25T13:35:07Z

Groupby for Series fails if an entry of the index of the Series is equal to the name of the index. Analyzing the bug, I think the following small change would solve the issue:

diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5fc713d84e..9281040a79 100644
--- a/pandas/core/groupby/grouper.py
+++ b/pandas/core/groupby/grouper.py
@@ -875,7 +875,7 @@ def get_grouper(
exclusions.add(gpr.name)

     elif is_in_axis(gpr):  # df.groupby('name')

```
       if gpr in obj:
```

       if not isinstance(obj, Series) and gpr in obj:
           if validate:
               obj._check_label_or_level_ambiguity(gpr, axis=axis)
           in_axis, name, gpr = True, gpr, obj[gpr]

From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change. I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst.

closes BUG: Groupby for Series fails if an entry of the index of the Series is equal to the name of the index #48567 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Groupby for Series fails if an entry of the index of the Series is equal to the name of the index. Analyzing the bug, I think the following small change would solve the issue: diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5fc713d..9281040 100644 --- a/pandas/core/groupby/grouper.py +++ b/pandas/core/groupby/grouper.py @@ -875,7 +875,7 @@ def get_grouper( exclusions.add(gpr.name) elif is_in_axis(gpr): # df.groupby('name') - if gpr in obj: + if not isinstance(obj, Series) and gpr in obj: if validate: obj._check_label_or_level_ambiguity(gpr, axis=axis) in_axis, name, gpr = True, gpr, obj[gpr] From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change. I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst.

mroeschke · 2022-09-26T18:00:06Z

pandas/tests/groupby/test_groupby.py

+                       index=Index(['bar', 'baz', 'blah', 'foo'], name='blah'))
+    tm.assert_series_equal(result1, expected1)
+
+    series2 = Series(data=[1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='values',


Could you use pytest.mark.parameterize to simplify these similar tests?

Also looks like theres formatting issues here: https://github.com/pandas-dev/pandas/actions/runs/3122348100/jobs/5064232051

I will try to simplify the tests using pytest.mark.parameterize.

I am not able to use pre-commit (there is a known bug with posix_prefix, posix_local and virtualenv) on my system. I am still searching for a solution. Is there another way to check formatting?

pre-commit runs black and flake8 so if you're able to run those manually it should be comparable

Used pytest.mark.parameterize to simplify similar tests. Corrected formatting issues.

mroeschke · 2022-09-27T22:43:33Z

pandas/core/groupby/grouper.py

@@ -875,7 +875,7 @@ def is_in_obj(gpr) -> bool:
            exclusions.add(gpr.name)

        elif is_in_axis(gpr):  # df.groupby('name')
-            if gpr in obj:
+            if not isinstance(obj, Series) and gpr in obj:


Maybe this should be grp in group_axis?

I am not 100% familiar with this code path but there may be a case where this can apply to the Series index?

For Series index the if-statement one line before "is_in_axis(gpr)" gives true. But the if-statement "gpr in obj" is only for dataframes to check whether the group key gpr is one of the columns. For series (and gpr in series index) the corresponding else-statement is correct. The new test cases prove that the result is correct, if the group key is the series index. I checked that it is working with series that has MultiIndex as well.

Agreed - this looks good to me. For a Series, we only need to consider the case axis=0. So when you have an "in-axis grouper" for a Series, I think there are conceptually only two options: the index, or the values itself. I believe grouping the by the values via the in-axis path, e.g.

ser = pd.Series([1, 2, 1, 2], index=[1, 1, 2, 2], name="b") ser.index.name = "a" ser.groupby("b").sum()

is entirely not supported (though ser.groupby(ser).sum() is supported - but that's a different code path) and is outside the scope here. So that only leaves checking if the provided grouper matches the index name, which is the elif block here.

rhshadrach

Looks great! Small request.

pandas/core/groupby/grouper.py

Changed the isinstance check to obj.ndim != 1 as requested.

rhshadrach

lgtm

mroeschke · 2022-09-30T16:20:12Z

Thanks @JanVHII

* BUG: Bug fix for GH48567 Groupby for Series fails if an entry of the index of the Series is equal to the name of the index. Analyzing the bug, I think the following small change would solve the issue: diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5fc713d..9281040 100644 --- a/pandas/core/groupby/grouper.py +++ b/pandas/core/groupby/grouper.py @@ -875,7 +875,7 @@ def get_grouper( exclusions.add(gpr.name) elif is_in_axis(gpr): # df.groupby('name') - if gpr in obj: + if not isinstance(obj, Series) and gpr in obj: if validate: obj._check_label_or_level_ambiguity(gpr, axis=axis) in_axis, name, gpr = True, gpr, obj[gpr] From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change. I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst. * BUG: Bug fix for GH48567 Used pytest.mark.parameterize to simplify similar tests. Corrected formatting issues. * BUG: Bug fix for GH48567 Changed the isinstance check to obj.ndim != 1 as requested.

mroeschke reviewed Sep 26, 2022

View reviewed changes

mroeschke added Bug Groupby labels Sep 26, 2022

BUG: Bug fix for GH48567

85bd0d8

Used pytest.mark.parameterize to simplify similar tests. Corrected formatting issues.

mroeschke reviewed Sep 27, 2022

View reviewed changes

rhshadrach requested changes Sep 30, 2022

View reviewed changes

pandas/core/groupby/grouper.py Outdated Show resolved Hide resolved

BUG: Bug fix for GH48567

41ac027

Changed the isinstance check to obj.ndim != 1 as requested.

rhshadrach approved these changes Sep 30, 2022

View reviewed changes

mroeschke added this to the 1.6 milestone Sep 30, 2022

mroeschke approved these changes Sep 30, 2022

View reviewed changes

mroeschke merged commit d76b9f2 into pandas-dev:main Sep 30, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Bug fix for GH48567 #48772

BUG: Bug fix for GH48567 #48772

JanVHII commented Sep 25, 2022

mroeschke Sep 26, 2022

JanVHII Sep 26, 2022

mroeschke Sep 26, 2022

mroeschke Sep 27, 2022

JanVHII Sep 30, 2022

rhshadrach Sep 30, 2022

rhshadrach left a comment

rhshadrach left a comment

mroeschke commented Sep 30, 2022

BUG: Bug fix for GH48567 #48772

BUG: Bug fix for GH48567 #48772

Conversation

JanVHII commented Sep 25, 2022

mroeschke Sep 26, 2022

Choose a reason for hiding this comment

JanVHII Sep 26, 2022

Choose a reason for hiding this comment

mroeschke Sep 26, 2022

Choose a reason for hiding this comment

mroeschke Sep 27, 2022

Choose a reason for hiding this comment

JanVHII Sep 30, 2022

Choose a reason for hiding this comment

rhshadrach Sep 30, 2022

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

mroeschke commented Sep 30, 2022