-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix groupby observed=True when aggregating a column #24412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bug would have impacted any groupby function that relied on `observed` if it were `True`
Codecov Report
@@ Coverage Diff @@
## master #24412 +/- ##
=======================================
Coverage 92.3% 92.3%
=======================================
Files 163 163
Lines 51943 51943
=======================================
Hits 47946 47946
Misses 3997 3997
Continue to review full report at Codecov.
|
4 similar comments
Codecov Report
@@ Coverage Diff @@
## master #24412 +/- ##
=======================================
Coverage 92.3% 92.3%
=======================================
Files 163 163
Lines 51943 51943
=======================================
Hits 47946 47946
Misses 3997 3997
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24412 +/- ##
=======================================
Coverage 92.3% 92.3%
=======================================
Files 163 163
Lines 51943 51943
=======================================
Hits 47946 47946
Misses 3997 3997
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24412 +/- ##
=======================================
Coverage 92.3% 92.3%
=======================================
Files 163 163
Lines 51943 51943
=======================================
Hits 47946 47946
Misses 3997 3997
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24412 +/- ##
=======================================
Coverage 92.3% 92.3%
=======================================
Files 163 163
Lines 51943 51943
=======================================
Hits 47946 47946
Misses 3997 3997
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24412 +/- ##
===========================================
- Coverage 92.3% 31.9% -60.41%
===========================================
Files 163 166 +3
Lines 51977 52421 +444
===========================================
- Hits 47979 16723 -31256
- Misses 3998 35698 +31700
Continue to review full report at Codecov.
|
pandas/tests/groupby/test_groupby.py
Outdated
'a': pd.Series([1, 1, 2], dtype='category'), | ||
'b': [1, 2, 2], 'x': [1, 2, 3]}) | ||
|
||
expected = df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the observed fixture as well here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't quite understand. Could you elaborate?
Hello @Koustav-Samaddar! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 28, 2018 at 03:23 Hours UTC |
'b': [1, 2, 2], 'x': [1, 2, 3]}) | ||
|
||
result = expected.groupby(['a', 'b'], | ||
as_index=False, observed=True)['x'].sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is as_index
required here? I see the comment in the original issue though based off of code correction and surprised if that's really the culprit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_index
is required to demonstrate that the bug exists. The code block that requires the correct value of observed
is only enetered if as_index=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To further clarify, no as_index
a.k.a. as_index=True
would show no difference since the code that is executed in this case does not use observed
.
'a': pd.Series([1, 1, 2], dtype='category'), | ||
'b': [1, 2, 2], 'x': [1, 2, 3]}) | ||
|
||
result = expected.groupby(['a', 'b'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be reading this wrong but do we need the non-categorical column in the grouper to emulate this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is because the bug generates a group that doesn't exist due to using some weird Cartesian product of the two columns used in the grouper.
Therefore, while the only valid groupings are (1, 1) (1, 2) (2, 2)
; the bug uses the CP of the two set(1, 2) * set(1, 2)
which results in an additional non-existent (2, 1)
grouping as well which is represented as NaN
in the bugged output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for the responses. In that case I think you can improve the whatsnew to say that the observed keyword was essentially being ignored when selecting one column as part of the GroupBy. It’s currently too vague IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also used to be the case. But I changed it to be simpler on @mroeschke 's advice (in a previous resolved convo).
I can change it back if needed but would just like a confirmation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarity is always welcome. But generally whatsnew entries should refer to externally facing methods and implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the previous entry was arguably too specific mentioning the exact internal method where the problem arose. The current entry is too vague to be useful.
If you can meet somewhere in the middle on that I think it again would be mentioning that the observed
keyword was not being respected, but without mentioning the private internal methods where the implementation was actually changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed it. Hopefully this is a better middle ground between the 2 previous approaches!
|
||
def test_groupby_agg_observed_true_single_column(): | ||
# GH-23970 | ||
expected = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df
and expected
should be constructed as separate objects. I think generally this test case could be made more explicit depending on answer to other questions, but once we can clarify those pieces will want to split this out as two separate variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can create a test case where df
and expected
are different but that'll cause df
to be larger.
I originally had expected
and df
as separate variable names but on @mroeschke 's advice (in a previous resolved convo) I replaced df
with expected
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize over as_index=True/False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was parameterising as_index
, but I hit a roadblock - I'm having difficulty recreating the expected value for as_index=True
and I'm stumped how to go about fixing it.
Any help is greatly appreciated.
import pandas as pd
from pandas.util import testing as tm
expected = pd.DataFrame({
'a': pd.Series([1, 1, 2], dtype='category'),
'b': [1, 2, 2],
'x': [1, 2, 3]
})
result = expected.groupby(['a', 'b'],
as_index=True,
observed=True)['x'].sum()
print(result)
"""
a b
1 1 1
2 2
2 2 3
Name: x, dtype: int64
"""
s_idx = pd.MultiIndex(levels=[[1, 2], [1, 2]],
codes=[[0, 0, 1], [0, 1, 1]],
names=['a', 'b'])
s_val = [ 1, 2, 3 ]
expected = pd.Series(index=s_idx, data=s_val, name='x')
print(expected)
"""
a b
1 1 1
2 2
2 2 3
Name: x, dtype: int64
"""
tm.assert_series_equal(result, expected)
"""
Traceback (most recent call last):
File "myTest.py", line 41, in <module>
tm.assert_series_equal(result, expected)
File "/home/vagrant/pandas/pandas/util/testing.py", line 1373, in assert_series_equal
obj='{obj}.index'.format(obj=obj))
File "/home/vagrant/pandas/pandas/util/testing.py", line 942, in assert_index_equal
check_exact=check_exact, obj=lobj)
File "/home/vagrant/pandas/pandas/util/testing.py", line 915, in assert_index_equal
_check_types(left, right, obj=obj)
File "/home/vagrant/pandas/pandas/util/testing.py", line 891, in _check_types
assert_class_equal(l, r, exact=exact, obj=obj)
File "/home/vagrant/pandas/pandas/util/testing.py", line 996, in assert_class_equal
repr_class(right))
File "/home/vagrant/pandas/pandas/util/testing.py", line 1188, in raise_assert_detail
raise AssertionError(msg)
AssertionError: MultiIndex level [0] are different
MultiIndex level [0] classes are not equivalent
[left]: CategoricalIndex([1, 1, 2], categories=[1, 2], ordered=False, name='a', dtype='category')
[right]: Int64Index([1, 1, 2], dtype='int64', name='a')
"""
I understand that the issue is with not using CategoricalIndex at MultiIndex[0] but I have no idea how to go about doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use MultiIndex.from_arrays
and pass a list of the Index already of the correct types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
doc/source/whatsnew/v0.24.0.rst
Outdated
@@ -1560,6 +1560,7 @@ Groupby/Resample/Rolling | |||
- Bug in :meth:`pandas.core.groupby.GroupBy.rank` with ``method='dense'`` and ``pct=True`` when a group has only one member would raise a ``ZeroDivisionError`` (:issue:`23666`). | |||
- Calling :meth:`pandas.core.groupby.GroupBy.rank` with empty groups and ``pct=True`` was raising a ``ZeroDivisionError`` (:issue:`22519`) | |||
- Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`). | |||
- Bug in :meth:`DataFrame.groupby` did not work correctly with ``observed=True`` when aggregating a specified column (:issue:`23970`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is when its a categorical column; also this is only for as_index=False
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any operation over a selected column was effectively ignoring the observed
parameter that was being passed to it. With as_index=True
the observed
value isn't used so the code behaves as expected.
|
||
def test_groupby_agg_observed_true_single_column(): | ||
# GH-23970 | ||
expected = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize over as_index=True/False
|
||
def test_groupby_agg_observed_true_single_column(): | ||
# GH-23970 | ||
expected = pd.DataFrame({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use MultiIndex.from_arrays
and pass a list of the Index already of the correct types
A small note: this does not close #21151. |
result = df.groupby( | ||
['a', 'b'], as_index=as_index, observed=True)['x'].sum() | ||
|
||
if isinstance(result, pd.DataFrame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you an use assert_equal here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Didn't know about that one
minor comment. ping on green. @WillAyd ? |
thanks! |
git diff upstream/master -u -- "*.py" | flake8 --diff