Skip to content

BUG: Fix groupby observed=True when aggregating a column #24412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 31, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1560,6 +1560,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.rank` with ``method='dense'`` and ``pct=True`` when a group has only one member would raise a ``ZeroDivisionError`` (:issue:`23666`).
- Calling :meth:`pandas.core.groupby.GroupBy.rank` with empty groups and ``pct=True`` was raising a ``ZeroDivisionError`` (:issue:`22519`)
- Bug in :meth:`DataFrame.resample` when resampling ``NaT`` in ``TimeDeltaIndex`` (:issue:`13223`).
- Bug in :meth:`DataFrame.groupby` did not work correctly with ``observed=True`` when aggregating a specified column (:issue:`23970`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is when its a categorical column; also this is only for as_index=False, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any operation over a selected column was effectively ignoring the observed parameter that was being passed to it. With as_index=True the observed value isn't used so the code behaves as expected.


Reshaping
^^^^^^^^^
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1336,7 +1336,8 @@ def _gotitem(self, key, ndim, subset=None):
return DataFrameGroupBy(subset, self.grouper, selection=key,
grouper=self.grouper,
exclusions=self.exclusions,
as_index=self.as_index)
as_index=self.as_index,
observed=self.observed)
elif ndim == 1:
if subset is None:
subset = self.obj[key]
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -860,3 +860,17 @@ def test_groupby_multiindex_categorical_datetime():
expected = pd.DataFrame(
{'values': [0, 4, 8, 3, 4, 5, 6, np.nan, 2]}, index=idx)
assert_frame_equal(result, expected)


def test_groupby_agg_observed_true_single_column():
# GH-23970
expected = pd.DataFrame({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df and expected should be constructed as separate objects. I think generally this test case could be made more explicit depending on answer to other questions, but once we can clarify those pieces will want to split this out as two separate variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a test case where df and expected are different but that'll cause df to be larger.

I originally had expected and df as separate variable names but on @mroeschke 's advice (in a previous resolved convo) I replaced df with expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize over as_index=True/False

Copy link
Contributor Author

@Koustav-Samaddar Koustav-Samaddar Dec 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was parameterising as_index, but I hit a roadblock - I'm having difficulty recreating the expected value for as_index=True and I'm stumped how to go about fixing it.
Any help is greatly appreciated.

import pandas as pd
from pandas.util import testing as tm

expected = pd.DataFrame({
    'a': pd.Series([1, 1, 2], dtype='category'),
    'b': [1, 2, 2],
    'x': [1, 2, 3]
})

result = expected.groupby(['a', 'b'], 
	as_index=True,
	observed=True)['x'].sum()

print(result)

"""
a  b
1  1    1
   2    2
2  2    3
Name: x, dtype: int64
"""

s_idx = pd.MultiIndex(levels=[[1, 2], [1, 2]],
           codes=[[0, 0, 1], [0, 1, 1]],
           names=['a', 'b'])

s_val = [ 1, 2, 3 ]

expected = pd.Series(index=s_idx, data=s_val, name='x')
print(expected)

"""
a  b
1  1    1
   2    2
2  2    3
Name: x, dtype: int64
"""

tm.assert_series_equal(result, expected)

"""
Traceback (most recent call last):
  File "myTest.py", line 41, in <module>
    tm.assert_series_equal(result, expected)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 1373, in assert_series_equal
    obj='{obj}.index'.format(obj=obj))
  File "/home/vagrant/pandas/pandas/util/testing.py", line 942, in assert_index_equal
    check_exact=check_exact, obj=lobj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 915, in assert_index_equal
    _check_types(left, right, obj=obj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 891, in _check_types
    assert_class_equal(l, r, exact=exact, obj=obj)
  File "/home/vagrant/pandas/pandas/util/testing.py", line 996, in assert_class_equal
    repr_class(right))
  File "/home/vagrant/pandas/pandas/util/testing.py", line 1188, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: MultiIndex level [0] are different

MultiIndex level [0] classes are not equivalent
[left]:  CategoricalIndex([1, 1, 2], categories=[1, 2], ordered=False, name='a', dtype='category')
[right]: Int64Index([1, 1, 2], dtype='int64', name='a')
"""

I understand that the issue is with not using CategoricalIndex at MultiIndex[0] but I have no idea how to go about doing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use MultiIndex.from_arrays and pass a list of the Index already of the correct types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

'a': pd.Series([1, 1, 2], dtype='category'),
'b': [1, 2, 2],
'x': [1, 2, 3]
})

result = expected.groupby(
['a', 'b'], as_index=False, observed=True)['x'].sum()

tm.assert_frame_equal(result, expected)