Skip to content

BUG: Bug fix for GH48567 #48772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 30, 2022
Merged

BUG: Bug fix for GH48567 #48772

merged 3 commits into from
Sep 30, 2022

Conversation

JanVHII
Copy link
Contributor

@JanVHII JanVHII commented Sep 25, 2022

Groupby for Series fails if an entry of the index of the Series is equal to the name of the index. Analyzing the bug, I think the following small change would solve the issue:

diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5fc713d84e..9281040a79 100644
--- a/pandas/core/groupby/grouper.py
+++ b/pandas/core/groupby/grouper.py
@@ -875,7 +875,7 @@ def get_grouper(
exclusions.add(gpr.name)

     elif is_in_axis(gpr):  # df.groupby('name')
  •        if gpr in obj:
    
  •        if not isinstance(obj, Series) and gpr in obj:
               if validate:
                   obj._check_label_or_level_ambiguity(gpr, axis=axis)
               in_axis, name, gpr = True, gpr, obj[gpr]
    

From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change. I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst.

Groupby for Series fails if an entry of the index of the Series is equal to the name of the index.
Analyzing the bug, I think the following small change would solve the issue:

diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py
index 5fc713d..9281040 100644
--- a/pandas/core/groupby/grouper.py
+++ b/pandas/core/groupby/grouper.py
@@ -875,7 +875,7 @@ def get_grouper(
             exclusions.add(gpr.name)

         elif is_in_axis(gpr):  # df.groupby('name')
-            if gpr in obj:
+            if not isinstance(obj, Series) and gpr in obj:
                 if validate:
                     obj._check_label_or_level_ambiguity(gpr, axis=axis)
                 in_axis, name, gpr = True, gpr, obj[gpr]

From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change.
I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst.
index=Index(['bar', 'baz', 'blah', 'foo'], name='blah'))
tm.assert_series_equal(result1, expected1)

series2 = Series(data=[1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='values',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use pytest.mark.parameterize to simplify these similar tests?

Also looks like theres formatting issues here: https://github.com/pandas-dev/pandas/actions/runs/3122348100/jobs/5064232051

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to simplify the tests using pytest.mark.parameterize.

I am not able to use pre-commit (there is a known bug with posix_prefix, posix_local and virtualenv) on my system. I am still searching for a solution. Is there another way to check formatting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-commit runs black and flake8 so if you're able to run those manually it should be comparable

Used pytest.mark.parameterize to simplify similar tests. Corrected formatting issues.
@@ -875,7 +875,7 @@ def is_in_obj(gpr) -> bool:
exclusions.add(gpr.name)

elif is_in_axis(gpr): # df.groupby('name')
if gpr in obj:
if not isinstance(obj, Series) and gpr in obj:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be grp in group_axis?

I am not 100% familiar with this code path but there may be a case where this can apply to the Series index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Series index the if-statement one line before "is_in_axis(gpr)" gives true. But the if-statement "gpr in obj" is only for dataframes to check whether the group key gpr is one of the columns. For series (and gpr in series index) the corresponding else-statement is correct. The new test cases prove that the result is correct, if the group key is the series index. I checked that it is working with series that has MultiIndex as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - this looks good to me. For a Series, we only need to consider the case axis=0. So when you have an "in-axis grouper" for a Series, I think there are conceptually only two options: the index, or the values itself. I believe grouping the by the values via the in-axis path, e.g.

ser = pd.Series([1, 2, 1, 2], index=[1, 1, 2, 2], name="b")
ser.index.name = "a"
ser.groupby("b").sum()

is entirely not supported (though ser.groupby(ser).sum() is supported - but that's a different code path) and is outside the scope here. So that only leaves checking if the provided grouper matches the index name, which is the elif block here.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Small request.

Changed the isinstance check to obj.ndim != 1 as requested.
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mroeschke mroeschke added this to the 1.6 milestone Sep 30, 2022
@mroeschke mroeschke merged commit d76b9f2 into pandas-dev:main Sep 30, 2022
@mroeschke
Copy link
Member

Thanks @JanVHII

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* BUG: Bug fix for GH48567

Groupby for Series fails if an entry of the index of the Series is equal to the name of the index.
Analyzing the bug, I think the following small change would solve the issue:

diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py
index 5fc713d..9281040 100644
--- a/pandas/core/groupby/grouper.py
+++ b/pandas/core/groupby/grouper.py
@@ -875,7 +875,7 @@ def get_grouper(
             exclusions.add(gpr.name)

         elif is_in_axis(gpr):  # df.groupby('name')
-            if gpr in obj:
+            if not isinstance(obj, Series) and gpr in obj:
                 if validate:
                     obj._check_label_or_level_ambiguity(gpr, axis=axis)
                 in_axis, name, gpr = True, gpr, obj[gpr]

From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change.
I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst.

* BUG: Bug fix for GH48567

Used pytest.mark.parameterize to simplify similar tests. Corrected formatting issues.

* BUG: Bug fix for GH48567

Changed the isinstance check to obj.ndim != 1 as requested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby for Series fails if an entry of the index of the Series is equal to the name of the index
3 participants