-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Bug fix for GH48567 #48772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Bug fix for GH48567 #48772
Conversation
Groupby for Series fails if an entry of the index of the Series is equal to the name of the index. Analyzing the bug, I think the following small change would solve the issue: diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5fc713d..9281040 100644 --- a/pandas/core/groupby/grouper.py +++ b/pandas/core/groupby/grouper.py @@ -875,7 +875,7 @@ def get_grouper( exclusions.add(gpr.name) elif is_in_axis(gpr): # df.groupby('name') - if gpr in obj: + if not isinstance(obj, Series) and gpr in obj: if validate: obj._check_label_or_level_ambiguity(gpr, axis=axis) in_axis, name, gpr = True, gpr, obj[gpr] From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change. I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst.
pandas/tests/groupby/test_groupby.py
Outdated
index=Index(['bar', 'baz', 'blah', 'foo'], name='blah')) | ||
tm.assert_series_equal(result1, expected1) | ||
|
||
series2 = Series(data=[1.0, 2.0, 3.0, 4.0, 5.0, 6.0], name='values', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use pytest.mark.parameterize
to simplify these similar tests?
Also looks like theres formatting issues here: https://github.com/pandas-dev/pandas/actions/runs/3122348100/jobs/5064232051
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to simplify the tests using pytest.mark.parameterize.
I am not able to use pre-commit (there is a known bug with posix_prefix, posix_local and virtualenv) on my system. I am still searching for a solution. Is there another way to check formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre-commit runs black and flake8 so if you're able to run those manually it should be comparable
Used pytest.mark.parameterize to simplify similar tests. Corrected formatting issues.
pandas/core/groupby/grouper.py
Outdated
@@ -875,7 +875,7 @@ def is_in_obj(gpr) -> bool: | |||
exclusions.add(gpr.name) | |||
|
|||
elif is_in_axis(gpr): # df.groupby('name') | |||
if gpr in obj: | |||
if not isinstance(obj, Series) and gpr in obj: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be grp in group_axis
?
I am not 100% familiar with this code path but there may be a case where this can apply to the Series index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Series index the if-statement one line before "is_in_axis(gpr)" gives true. But the if-statement "gpr in obj" is only for dataframes to check whether the group key gpr is one of the columns. For series (and gpr in series index) the corresponding else-statement is correct. The new test cases prove that the result is correct, if the group key is the series index. I checked that it is working with series that has MultiIndex as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - this looks good to me. For a Series, we only need to consider the case axis=0
. So when you have an "in-axis grouper" for a Series, I think there are conceptually only two options: the index, or the values itself. I believe grouping the by the values via the in-axis path, e.g.
ser = pd.Series([1, 2, 1, 2], index=[1, 1, 2, 2], name="b")
ser.index.name = "a"
ser.groupby("b").sum()
is entirely not supported (though ser.groupby(ser).sum()
is supported - but that's a different code path) and is outside the scope here. So that only leaves checking if the provided grouper matches the index name, which is the elif block here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Small request.
Changed the isinstance check to obj.ndim != 1 as requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @JanVHII |
* BUG: Bug fix for GH48567 Groupby for Series fails if an entry of the index of the Series is equal to the name of the index. Analyzing the bug, I think the following small change would solve the issue: diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5fc713d..9281040 100644 --- a/pandas/core/groupby/grouper.py +++ b/pandas/core/groupby/grouper.py @@ -875,7 +875,7 @@ def get_grouper( exclusions.add(gpr.name) elif is_in_axis(gpr): # df.groupby('name') - if gpr in obj: + if not isinstance(obj, Series) and gpr in obj: if validate: obj._check_label_or_level_ambiguity(gpr, axis=axis) in_axis, name, gpr = True, gpr, obj[gpr] From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change. I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst. * BUG: Bug fix for GH48567 Used pytest.mark.parameterize to simplify similar tests. Corrected formatting issues. * BUG: Bug fix for GH48567 Changed the isinstance check to obj.ndim != 1 as requested.
Groupby for Series fails if an entry of the index of the Series is equal to the name of the index. Analyzing the bug, I think the following small change would solve the issue:
diff --git a/pandas/core/groupby/grouper.py b/pandas/core/groupby/grouper.py index 5fc713d84e..9281040a79 100644
--- a/pandas/core/groupby/grouper.py
+++ b/pandas/core/groupby/grouper.py
@@ -875,7 +875,7 @@ def get_grouper(
exclusions.add(gpr.name)
From my point of view "gpr in obj" does not make sense for series at this point. I think the if-statement is written for dataframes to look if gbr is one of the columns. Skipping the if-statement for series gives the correct results for the examples above. The testsuite for series and groupby runs without errors after the change. I added a test to the test suite at the end of pandas/tests/groupby/test_groupby.py and a comment in the BUG section to doc/source/whatsnew/v1.6.0.rst.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.