Skip to content

Commit b265349

Browse files
authored
BUG: Fix groupby sorting on ordered Categoricals (GH25871) (#1)
* BUG: Fix groupby on ordered Categoricals (GH25871) As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet. * BUG: Fix groupby on ordered Categoricals (GH25871) * new test * Fix groupby on ordered Categoricals (GH25871) Testing all combinations of: - ordered vs. unordered grouping column - 'observed' True vs. False - 'sort' True vs. False In all cases, result group ordering must be correct. The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)
1 parent 923ac2b commit b265349

File tree

3 files changed

+25
-1
lines changed

3 files changed

+25
-1
lines changed

doc/source/whatsnew/v0.25.0.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ Groupby/Resample/Rolling
347347
- Bug in :func:`pandas.core.groupby.GroupBy.agg` when applying a aggregation function to timezone aware data (:issue:`23683`)
348348
- Bug in :func:`pandas.core.groupby.GroupBy.first` and :func:`pandas.core.groupby.GroupBy.last` where timezone information would be dropped (:issue:`21603`)
349349
- Ensured that ordering of outputs in ``groupby`` aggregation functions is consistent across all versions of Python (:issue:`25692`)
350-
350+
- Ensured that result group order is correct when grouping on an ordered Categorical and specifying ``observed=True`` (:issue:`25871`)
351351

352352
Reshaping
353353
^^^^^^^^^

pandas/core/groupby/grouper.py

+2
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,8 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
302302
if observed:
303303
codes = algorithms.unique1d(self.grouper.codes)
304304
codes = codes[codes != -1]
305+
if sort or self.grouper.ordered:
306+
codes = np.sort(codes)
305307
else:
306308
codes = np.arange(len(categories))
307309

pandas/tests/groupby/test_categorical.py

+22
Original file line numberDiff line numberDiff line change
@@ -453,6 +453,28 @@ def test_dataframe_categorical_with_nan(observed):
453453
tm.assert_frame_equal(result, expected)
454454

455455

456+
@pytest.mark.parametrize("ordered", [True, False])
457+
@pytest.mark.parametrize("observed", [True, False])
458+
@pytest.mark.parametrize("sort", [True, False])
459+
def test_dataframe_categorical_ordered_observed_sort(ordered, observed, sort):
460+
# GH 25871: Fix groupby sorting on ordered Categoricals
461+
# Build a dataframe with a Categorical having one unobserved category ('AWOL'), and a Series with identical values
462+
cat = pd.Categorical(['d', 'a', 'b', 'a', 'd', 'b'], categories=['a', 'b', 'AWOL', 'd'], ordered=ordered)
463+
val = pd.Series (['d', 'a', 'b', 'a', 'd', 'b'])
464+
df = pd.DataFrame({'cat': cat, 'val': val})
465+
466+
# aggregate on the Categorical
467+
result = df.groupby('cat', observed=observed, sort=sort)['val'].agg('first')
468+
469+
# If ordering is correct, we expect index labels equal to aggregation results,
470+
# except for 'observed=False', when index contains 'AWOL' and aggregation None
471+
label = pd.Series(result.index.array, dtype='object')
472+
aggr = pd.Series(result.array)
473+
if not observed:
474+
aggr[aggr.isna()] = 'AWOL'
475+
tm.assert_equal(label, aggr)
476+
477+
456478
def test_datetime():
457479
# GH9049: ensure backward compatibility
458480
levels = pd.date_range('2014-01-01', periods=4)

0 commit comments

Comments
 (0)