Skip to content

BUG: sort=False ignored when grouping with a categorical column #8868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aimboden opened this issue Nov 21, 2014 · 1 comment · Fixed by #9480
Closed

BUG: sort=False ignored when grouping with a categorical column #8868

aimboden opened this issue Nov 21, 2014 · 1 comment · Fixed by #9480
Labels
API Design Categorical Categorical Data Type Groupby Interval Interval data type
Milestone

Comments

@aimboden
Copy link

Hello everyone,

I stumbled upon the following behavior of groubpy with categorical, which seems at least inconsistent with the way groupby usually operates.

When grouping on a string type column with sort=False, the order of the groups is the order in which the keys first appear in the column.

However, when grouping with a categorical column, the groups seem to be always ordered by the categorical, even when sort=False.

import pandas as pd
d = {'foo': [10, 8, 5, 6, 4, 1, 7], 'bar': [10, 20, 30, 40, 50, 60, 70],
     'baz': ['d', 'c', 'e', 'a', 'a', 'd', 'c']}
df = pd.DataFrame(d)
cat = pd.cut(df['foo'], np.linspace(0, 10, 5))
df['range'] = cat
groups = df.groupby('range', sort=True)
# Expected behaviour
result = groups.agg('mean')

# Why are the categorical still sorted in this case ?
groups2 = df.groupby('range', sort=False)
result2 = groups2.agg('mean')

# I would expect an output like this one: keep the order in which the groups
# are first encountered
groups3 = df.groupby('baz', sort=False)
result3 = groups3.agg('mean')
result
bar foo
range
(0, 2.5] 60 1.0
(2.5, 5] 40 4.0
(5, 7.5] 55 6.5
(7.5, 10] 15 CC
result2
bar foo
range
(0, 2.5] 60 1.0
(2.5, 5] 40 4.0
(5, 7.5] 55 6.5
(7.5, 10] 15 CC
result3
bar foo
baz
d 35 5.5
c 45 7.5
e 30 5.0
a 45 9.0
pd.__version__
Out[110]: '0.15.1'

Setting as_index=False does not change the presented bahavior.

@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

Currently this will only work naively, that is the Intervals that are returned are strings
e.g. (7.5, 10]. In 0.16 their is work being done on a Interval/IntervalIndex which will allow this to actually be sorted in a certain order. see #8707

as always pull-requests are welcome to work on these issues. (though reporting them makes good tests cases too!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Groupby Interval Interval data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants