groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871

kpflugshaupt · 2019-03-25T21:05:26Z

Code Sample:

import pandas as pd

# Create a DataFrame with an ordered categorical column, one category not present
df = pd.DataFrame(
            dict(cat = pd.Series([3, 1, 2, 1, 3, 2], 
                                 dtype=pd.CategoricalDtype(
                                              categories=[1, 2, 3, 4], 
                                              ordered=True)
                                ), 
                 val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0])
            )
)

Including unobserved categories gives correct groups:

# Sum 'val' grouped by 'cat', including unobserved categories
df.groupby('cat', observed=False)['val'].agg('sum')

cat
1    1.0
2    2.0
3    3.0
4    0.0
Name: val, dtype: float64

Excluding unobserved categories changes the order, groups are wrong:

# Sum 'val' grouped by 'cat', excluding unobserved categories
df.groupby('cat', observed=True)['val'].agg('sum')

cat
3    1.0
1    2.0
2    3.0
Name: val, dtype: float64

Problem description:

The sample code shows that grouping with an ordered factor does not respect the factor's order when 'observed=True' is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical's order.
Thus, the result is wrong.

Related issues: #25167
There, the Categorical was unordered, and the sort=True parameter did not work as expected in combination with observed=True. In my case, sort makes no difference:

df.groupby('cat', observed=True, sort=True)['val'].agg('sum')
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')

both give the same, wrong result as shown above.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.11.3
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-03-26T04:08:20Z

Unless I am overlooking something this looks to be a duplicate of the issue you've linked to. As such closing this and let's stick to that one

kpflugshaupt · 2019-03-26T05:27:34Z

Fair enough.
The scenario is not identical, though (ordered Categorical vs. ‚sort=True‘).
Looking at the code in grouper.py, it certainly seems to stem from the same place, but the ‚ordered‘ attribute and the ‚sort‘ parameter are separate concepts there, as well.
Maybe make this another unit Test?

WillAyd · 2019-03-26T05:37:10Z

A good test should be able to parametrize all of those parameters - if you'd like to take a look at the other issue and post a PR that would certainly be welcome!

kpflugshaupt · 2019-03-26T07:15:06Z

Right, I will try. Paid work keeps intruding, though...

kpflugshaupt · 2019-03-26T11:59:49Z

I have digged further, and found that this issue is not a duplicate of #25167:

The fix for #25167 does not fix this one completely

The PR #25173 fixes #25167. So, I applied the two code changes from that PR to my local installation (files grouper.py and categorical.py in pandas/core/groupby/).
This fixes my test case, but only if the 'sort' parameter is set to True. If sort=False, the bug persists (same result as shown above):

df.groupby('cat', observed=True, sort=False)['val'].agg('sum')

still returns

cat
3    1.0
1    2.0
2    3.0
Name: val, dtype: float64

after installing PR #25173.

If I modify it a bit, PR #25173 also fixes my problem

The relevant change from PR #25173 that fixes my sort=True case are these two new lines in grouper.py (inserted after line 302):

                    if sort:
                        codes = np.sort(codes)

If I extend the condition and also check for the grouper being ordered, the sort=False case also works:

                    if sort or self.grouper.ordered:
                        codes = np.sort(codes)

(This implies that grouping by an ordered Categorical and setting sort=True should result in the same ordering of codes. Also, that grouping by an ordered Categorical always sorts.)

With this modification, my test case seems OK both for sort=Trueand sort=False.

How to proceed

If my modification is OK (as in, breaks no tests and fits into Things), I propose to

either reopen BUG: fix sorting with observed=True in groupby #25173 and add my modification
or reopen this issue, and fix it as shown above

kpflugshaupt · 2019-03-26T12:44:49Z

Thanks, @WillAyd. Will you do the change, or is this expected from me?
When I clone the repository, I somehow do not see the changed sources of PR #25173, even though it was closed 7 days ago. It's probably me doing something wrong, but it makes me wary of trying this myself.

(Besides, as mentioned, Paid Work)

WillAyd · 2019-03-26T13:44:07Z

If you could push a PR I can take a look on the review side. Be sure to check out the contributing guide if you have trouble and you an ask specific development questions on Gitter:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

(Besides, as mentioned, Paid Work)

Pandas is for all practical purposes maintained entirely by volunteers - any help you can add to that is certainly welcome!

As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet.

* BUG: Fix groupby on ordered Categoricals (GH25871) As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet. * BUG: Fix groupby on ordered Categoricals (GH25871) * new test * Fix groupby on ordered Categoricals (GH25871) Testing all combinations of: - ordered vs. unordered grouping column - 'observed' True vs. False - 'sort' True vs. False In all cases, result group ordering must be correct. The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)

* BUG: Fix groupby sorting on ordered Categoricals (GH25871) (#1) * BUG: Fix groupby on ordered Categoricals (GH25871) As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet. * BUG: Fix groupby on ordered Categoricals (GH25871) * new test * Fix groupby on ordered Categoricals (GH25871) Testing all combinations of: - ordered vs. unordered grouping column - 'observed' True vs. False - 'sort' True vs. False In all cases, result group ordering must be correct. The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category) * Revert "BUG: Fix groupby sorting on ordered Categoricals (GH25871) (#1)" This reverts commit b265349.

This test had an adjustment for column order when 'observed=True' is set. This hid the fact that, with that parameter set, the data columns were not actually reordered -- it was just the column group labels (analogous to index labels in pandas-dev#25871), leaving the data columns in place and out of sync. (This was not visible as the data consisted only of ones). I've made the test more sensitive (unsyncing of data columns will be caught now) and removed the special case for 'observed=True'. As there are no unobserved categories in this case, the result should not be influenced by this parameter.

WillAyd closed this as completed Mar 26, 2019

WillAyd added the Duplicate Report Duplicate issue or pull request label Mar 26, 2019

WillAyd reopened this Mar 26, 2019

kpflugshaupt mentioned this issue Mar 26, 2019

BUG: Fix groupby sorting on ordered Categoricals (GH25871) kpflugshaupt/pandas#1

Merged

4 tasks

kpflugshaupt closed this as completed in kpflugshaupt/pandas#1 Mar 28, 2019

kpflugshaupt mentioned this issue Mar 28, 2019

BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25907

Closed

4 tasks

kpflugshaupt mentioned this issue Mar 28, 2019

BUG: Fix groupby sorting on ordered Categoricals (GH25871) #25908

Merged

4 tasks

jreback added this to the 0.25.0 milestone Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871

groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871

kpflugshaupt commented Mar 25, 2019

INSTALLED VERSIONS

WillAyd commented Mar 26, 2019

kpflugshaupt commented Mar 26, 2019

WillAyd commented Mar 26, 2019

kpflugshaupt commented Mar 26, 2019

kpflugshaupt commented Mar 26, 2019 •

edited

Loading

kpflugshaupt commented Mar 26, 2019

WillAyd commented Mar 26, 2019

groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871

groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871

Comments

kpflugshaupt commented Mar 25, 2019

Code Sample:

Problem description:

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Mar 26, 2019

kpflugshaupt commented Mar 26, 2019

WillAyd commented Mar 26, 2019

kpflugshaupt commented Mar 26, 2019

kpflugshaupt commented Mar 26, 2019 • edited Loading

I have digged further, and found that this issue is not a duplicate of #25167:

The fix for #25167 does not fix this one completely

If I modify it a bit, PR #25173 also fixes my problem

How to proceed

kpflugshaupt commented Mar 26, 2019

WillAyd commented Mar 26, 2019

Output of `pd.show_versions()`

kpflugshaupt commented Mar 26, 2019 •

edited

Loading