-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Unless I am overlooking something this looks to be a duplicate of the issue you've linked to. As such closing this and let's stick to that one |
Fair enough. |
A good test should be able to parametrize all of those parameters - if you'd like to take a look at the other issue and post a PR that would certainly be welcome! |
Right, I will try. Paid work keeps intruding, though... |
I have digged further, and found that this issue is not a duplicate of #25167:The fix for #25167 does not fix this one completelyThe PR #25173 fixes #25167. So, I applied the two code changes from that PR to my local installation (files grouper.py and categorical.py in pandas/core/groupby/). df.groupby('cat', observed=True, sort=False)['val'].agg('sum') still returns
after installing PR #25173. If I modify it a bit, PR #25173 also fixes my problemThe relevant change from PR #25173 that fixes my if sort:
codes = np.sort(codes) If I extend the condition and also check for the grouper being ordered, the if sort or self.grouper.ordered:
codes = np.sort(codes) (This implies that grouping by an ordered Categorical and setting With this modification, my test case seems OK both for How to proceedIf my modification is OK (as in, breaks no tests and fits into Things), I propose to
|
Thanks, @WillAyd. Will you do the change, or is this expected from me? (Besides, as mentioned, Paid Work) |
If you could push a PR I can take a look on the review side. Be sure to check out the contributing guide if you have trouble and you an ask specific development questions on Gitter: https://pandas.pydata.org/pandas-docs/stable/development/contributing.html
Pandas is for all practical purposes maintained entirely by volunteers - any help you can add to that is certainly welcome! |
As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet.
* BUG: Fix groupby on ordered Categoricals (GH25871) As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet. * BUG: Fix groupby on ordered Categoricals (GH25871) * new test * Fix groupby on ordered Categoricals (GH25871) Testing all combinations of: - ordered vs. unordered grouping column - 'observed' True vs. False - 'sort' True vs. False In all cases, result group ordering must be correct. The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category)
* BUG: Fix groupby sorting on ordered Categoricals (GH25871) (#1) * BUG: Fix groupby on ordered Categoricals (GH25871) As documented in pandas-dev#25871, groupby() on an ordered Categorical messes up category order when 'observed=True' is specified. Specifically, group labels will be ordered by first occurrence (as for an unordered Categorical), but grouped aggregation results will retain the Categorical's order. The fix is a modified subset of pandas-dev#25173, which fixes a related case, but has not been merged yet. * BUG: Fix groupby on ordered Categoricals (GH25871) * new test * Fix groupby on ordered Categoricals (GH25871) Testing all combinations of: - ordered vs. unordered grouping column - 'observed' True vs. False - 'sort' True vs. False In all cases, result group ordering must be correct. The test is built such that the result index labels are equal to aggregation results if all goes well (except for the one unobserved category) * Revert "BUG: Fix groupby sorting on ordered Categoricals (GH25871) (#1)" This reverts commit b265349.
This test had an adjustment for column order when 'observed=True' is set. This hid the fact that, with that parameter set, the data columns were not actually reordered -- it was just the column group labels (analogous to index labels in pandas-dev#25871), leaving the data columns in place and out of sync. (This was not visible as the data consisted only of ones). I've made the test more sensitive (unsyncing of data columns will be caught now) and removed the special case for 'observed=True'. As there are no unobserved categories in this case, the result should not be influenced by this parameter.
Code Sample:
Including unobserved categories gives correct groups:
Excluding unobserved categories changes the order, groups are wrong:
Problem description:
The sample code shows that grouping with an ordered factor does not respect the factor's order when
'observed=True'
is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical's order.Thus, the result is wrong.
Related issues: #25167
There, the Categorical was unordered, and the
sort=True
parameter did not work as expected in combination withobserved=True
. In my case,sort
makes no difference:both give the same, wrong result as shown above.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.11.3
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: