Skip to content

Fix GroupBy nth Handling with Observed=False #26419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Aug 20, 2019
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
03ee26b
Added test coverage for observed=False with ops
WillAyd May 15, 2019
ee549ed
Fixed issue with observed=False and nth
WillAyd May 15, 2019
f0a510d
Stubbed whatsnew note
WillAyd May 15, 2019
f671204
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd May 16, 2019
94dda01
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd May 17, 2019
e59a991
lint fixup
WillAyd May 17, 2019
3677471
Simplified test
WillAyd May 19, 2019
34c2f06
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd May 19, 2019
2ca34e3
whatsnew whitespace fix
WillAyd May 19, 2019
d3e5efa
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd Jun 3, 2019
f9758b8
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd Jun 4, 2019
ad729c5
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd Jun 27, 2019
5b7b6bc
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd Jul 15, 2019
aff7327
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd Jul 15, 2019
47201fb
blackify
WillAyd Jul 15, 2019
56822cc
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd Jul 15, 2019
1804e27
Removed doc whitespace
WillAyd Jul 15, 2019
4c2e413
Merge remote-tracking branch 'upstream/master' into nth-na-handling
WillAyd Jul 25, 2019
a837564
moved whatsnew to 0.25.1
WillAyd Jul 25, 2019
308e569
Merge remote-tracking branch 'upstream/master' into WillAyd-nth-na-ha…
TomAugspurger Aug 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,7 @@ Groupby/Resample/Rolling
- Bug in :meth:`pandas.core.groupby.GroupBy.cumsum`, :meth:`pandas.core.groupby.GroupBy.cumprod`, :meth:`pandas.core.groupby.GroupBy.cummin` and :meth:`pandas.core.groupby.GroupBy.cummax` with categorical column having absent categories, would return incorrect result or segfault (:issue:`16771`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where NA values in the grouping would return incorrect results (:issue:`26011`)
- Bug in :meth:`pandas.core.groupby.SeriesGroupBy.transform` where transforming an empty group would raise error (:issue:`26208`)
- Bug in :meth:`pandas.core.groupby.GroupBy.nth` where ``observed=False`` was being ignored for Categorical groupers (:issue:`26385`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.25.1



Reshaping
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ class providing the base-class of operations.
from pandas.core.frame import DataFrame
from pandas.core.generic import NDFrame
from pandas.core.groupby import base
from pandas.core.index import Index, MultiIndex
from pandas.core.index import CategoricalIndex, Index, MultiIndex
from pandas.core.series import Series
from pandas.core.sorting import get_group_index_sorter

Expand Down Expand Up @@ -839,6 +839,7 @@ def _cython_transform(self, how, numeric_only=True, **kwargs):
def _cython_agg_general(self, how, alt=None, numeric_only=True,
min_count=-1):
output = {}

for name, obj in self._iterate_slices():
is_numeric = is_numeric_dtype(obj.dtype)
if numeric_only and not is_numeric:
Expand Down Expand Up @@ -1707,7 +1708,12 @@ def nth(self,
if not self.as_index:
return out

out.index = self.grouper.result_index[ids[mask]]
result_index = self.grouper.result_index
out.index = result_index[ids[mask]]

if not self.observed and isinstance(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to get this into _wrap_aggregated_output (which is what wraps most of the other methods) but that method doesn't appear to do any reindexing so would have to mess with that to make it work. Figured this as a one-off was easier though ideally would have everything consolidated...

Copy link
Contributor

@jreback jreback May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need a small change I think in result_index. This is very similar to what we do with all_grouper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC I think result_index is fine and handles properly; the problem here is specific to nth where the mask on the preceding lines essentially ignores the work that result_index does to properly handle NA categorical data, hence the reindex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced here; this looks like just a bandaid to me and there is an underlying issue that needs fixing. also what about dropna=True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropna='all' seems to have a separate set of issues. Here's behavior on master:

>>> grp.nth(0, dropna='all')
cat
a    1.0
b    2.0
c    3.0
Name: ser, dtype: float64

Which is not correct since b and c are never associated with those values.

Also trying to mix that with observed doesn't work:

>>> grp = df.groupby('cat', observed=True)['ser']
>>> grp.nth(0, dropna='all')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/williamayd/clones/pandas/pandas/core/groupby/groupby.py", line 1773, in nth
    result.index = self.grouper.result_index
  File "/Users/williamayd/clones/pandas/pandas/core/generic.py", line 5144, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 67, in pandas._libs.properties.AxisProperty.__set__
    obj._set_axis(self.axis, value)
  File "/Users/williamayd/clones/pandas/pandas/core/series.py", line 381, in _set_axis
    self._data.set_axis(axis, labels)
  File "/Users/williamayd/clones/pandas/pandas/core/internals/managers.py", line 155, in set_axis
    'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 3 elements, new values have 1 elements

So both weird though I'm not sure how we want to handle the combinations of observed and dropna; will open a separate issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #26454

result_index, CategoricalIndex):
out = out.reindex(result_index)

return out.sort_index() if self.sort else out

Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/groupby/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,12 @@ def three_group():
'D': np.random.randn(11),
'E': np.random.randn(11),
'F': np.random.randn(11)})


AGG_FUNCS = ['sum', 'prod', 'min', 'max', 'mean', 'median', 'var', 'first',
'last', 'nth'] # TODO: ohlc?


@pytest.fixture(params=AGG_FUNCS)
def agg_func(request):
return request.param
29 changes: 29 additions & 0 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,35 @@ def test_observed_groups_with_nan(observed):
tm.assert_dict_equal(result, expected)


def test_observed_ops(agg_func):
cat = pd.Categorical(['a', np.nan, np.nan], categories=['a', 'b', 'c'])
ser = pd.Series([1., 2., 3.])
df = pd.DataFrame({'cat': cat, 'ser': ser})

grp = df.groupby('cat', observed=False)['ser']
func = getattr(grp, agg_func)

if agg_func == 'nth': # Need an argument
result = func(0)
else:
result = func()

if agg_func == 'sum': # TODO: maybe a bug?
expected_vals = [1., 0., 0.]
elif agg_func == 'prod': # TODO: Definitely seems like a bug
expected_vals = [1., 1., 1.]
elif agg_func == 'var':
expected_vals = [np.nan, np.nan, np.nan]
else:
expected_vals = [1., np.nan, np.nan]

index = pd.Categorical(['a', 'b', 'c'], categories=['a', 'b', 'c'])
expected = pd.Series(expected_vals, index=index, name='ser')
expected.index.name = 'cat'

tm.assert_series_equal(result, expected)


def test_dataframe_categorical_with_nan(observed):
# GH 21151
s1 = pd.Categorical([np.nan, 'a', np.nan, 'a'],
Expand Down