Skip to content

BUG: Fix groupby observed=True when aggregating a column #24412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 31, 2018
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1560,6 +1560,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.rank` with ``method='dense'`` and ``pct=True`` when a group has only one member would raise a ``ZeroDivisionError`` (:issue:`23666`).
- Calling :meth:`pandas.core.groupby.GroupBy.rank` with empty groups and ``pct=True`` was raising a ``ZeroDivisionError`` (:issue:`22519`)
- Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`).
- Bug in :meth:`DataFrame.groupby` did not work correctly with ``observed=True`` when aggregating a specified column (:issue:`23970`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is when its a categorical column; also this is only for as_index=False, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any operation over a selected column was effectively ignoring the observed parameter that was being passed to it. With as_index=True the observed value isn't used so the code behaves as expected.


Reshaping
^^^^^^^^^
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1336,7 +1336,8 @@ def _gotitem(self, key, ndim, subset=None):
return DataFrameGroupBy(subset, self.grouper, selection=key,
grouper=self.grouper,
exclusions=self.exclusions,
as_index=self.as_index)
as_index=self.as_index,
observed=self.observed)
elif ndim == 1:
if subset is None:
subset = self.obj[key]
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -860,3 +860,15 @@ def test_groupby_multiindex_categorical_datetime():
expected = pd.DataFrame(
{'values': [0, 4, 8, 3, 4, 5, 6, np.nan, 2]}, index=idx)
assert_frame_equal(result, expected)


def test_groupby_agg_observed_true_single_column():
# GH-23970
expected = pd.DataFrame({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df and expected should be constructed as separate objects. I think generally this test case could be made more explicit depending on answer to other questions, but once we can clarify those pieces will want to split this out as two separate variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a test case where df and expected are different but that'll cause df to be larger.

I originally had expected and df as separate variable names but on @mroeschke 's advice (in a previous resolved convo) I replaced df with expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize over as_index=True/False

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was parameterising as_index, but I hit a roadblock - I'm having difficulty recreating the expected value for as_index=True and I'm stumped how to go about fixing it.
Any help is greatly appreciated.

import pandas as pd
from pandas.util import testing as tm

expected = pd.DataFrame({
    'a': pd.Series([1, 1, 2], dtype='category'),
    'b': [1, 2, 2],
    'x': [1, 2, 3]
})

result = expected.groupby(['a', 'b'], 
	as_index=True,
	observed=True)['x'].sum()

print(result)

"""
a  b
1  1    1
   2    2
2  2    3
Name: x, dtype: int64
"""

s_idx = pd.MultiIndex(levels=[[1, 2], [1, 2]],
           codes=[[0, 0, 1], [0, 1, 1]],
           names=['a', 'b'])

s_val = [ 1, 2, 3 ]

expected = pd.Series(index=s_idx, data=s_val, name='x')
print(expected)

"""
a  b
1  1    1
   2    2
2  2    3
Name: x, dtype: int64
"""

tm.assert_series_equal(result, expected)

"""
Traceback (most recent call last):
  File "myTest.py", line 41, in <module>
    tm.assert_series_equal(result, expected)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 1373, in assert_series_equal
    obj='{obj}.index'.format(obj=obj))
  File "/home/vagrant/pandas/pandas/util/testing.py", line 942, in assert_index_equal
    check_exact=check_exact, obj=lobj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 915, in assert_index_equal
    _check_types(left, right, obj=obj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 891, in _check_types
    assert_class_equal(l, r, exact=exact, obj=obj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 996, in assert_class_equal
    repr_class(right))
  File "/home/vagrant/pandas/pandas/util/testing.py", line 1188, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: MultiIndex level [0] are different

MultiIndex level [0] classes are not equivalent
[left]:  CategoricalIndex([1, 1, 2], categories=[1, 2], ordered=False, name='a', dtype='category')
[right]: Int64Index([1, 1, 2], dtype='int64', name='a')
"""

I understand that the issue is with not using CategoricalIndex at MultiIndex[0] but I have no idea how to go about doing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use MultiIndex.from_arrays and pass a list of the Index already of the correct types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

'a': pd.Series([1, 1, 2], dtype='category'),
'b': [1, 2, 2], 'x': [1, 2, 3]})

result = expected.groupby(['a', 'b'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be reading this wrong but do we need the non-categorical column in the grouper to emulate this behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is because the bug generates a group that doesn't exist due to using some weird Cartesian product of the two columns used in the grouper.

Therefore, while the only valid groupings are (1, 1) (1, 2) (2, 2); the bug uses the CP of the two set(1, 2) * set(1, 2) which results in an additional non-existent (2, 1) grouping as well which is represented as NaN in the bugged output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for the responses. In that case I think you can improve the whatsnew to say that the observed keyword was essentially being ignored when selecting one column as part of the GroupBy. It’s currently too vague IMO

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Dec 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also used to be the case. But I changed it to be simpler on @mroeschke 's advice (in a previous resolved convo).

I can change it back if needed but would just like a confirmation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarity is always welcome. But generally whatsnew entries should refer to externally facing methods and implications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the previous entry was arguably too specific mentioning the exact internal method where the problem arose. The current entry is too vague to be useful.

If you can meet somewhere in the middle on that I think it again would be mentioning that the observed keyword was not being respected, but without mentioning the private internal methods where the implementation was actually changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed it. Hopefully this is a better middle ground between the 2 previous approaches!

as_index=False, observed=True)['x'].sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is as_index required here? I see the comment in the original issue though based off of code correction and surprised if that's really the culprit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_index is required to demonstrate that the bug exists. The code block that requires the correct value of observed is only enetered if as_index=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To further clarify, no as_index a.k.a. as_index=True would show no difference since the code that is executed in this case does not use observed.


tm.assert_frame_equal(result, expected)