Description
Code Sample, a copy-pastable example if possible
test = pd.DataFrame(
{
'foo' : ['small', 'large', 'large', 'large', 'medium', 'large', 'large', 'medium'],
'bar' : ['C', 'A', 'A', 'C', 'A', 'C', 'A', 'C']
})
test['foo'] = test['foo'].astype('category').cat.set_categories(['tiny','small', 'medium', 'large'], ordered=True)
test.groupby(['bar', 'foo']).size().unstack()
# output
foo medium large small
bar
A 1.0 3.0 NaN
C 1.0 2.0 1.0
Problem description
I have a dataframe with an ordered category column foo
. I want to group by both columns then take the size of the groups and unstack to get a summary table. If all of the values in my ordered category are in the data, then the result is as expected:
test = pd.DataFrame(
{
'foo' : ['small', 'large', 'large', 'large', 'medium', 'large', 'large', 'medium'],
'bar' : ['C', 'A', 'A', 'C', 'A', 'C', 'A', 'C']
})
test['foo'] = test['foo'].astype('category').cat.set_categories(['small', 'medium', 'large'], ordered=True)
print(test.groupby(['bar', 'foo']).size().unstack())
# output
foo small medium large
bar
A NaN 1.0 3.0
C 1.0 1.0 2.0
My columns appear in the specified order. However, if for some reason I have categories that are listed but don't actually appear in the data (in this case, 'tiny'
) the order seems to be determined by the order that the categories appear in the series before stacking:
test = pd.DataFrame(
{
'foo' : ['small', 'large', 'large', 'large', 'medium', 'large', 'large', 'medium'],
'bar' : ['C', 'A', 'A', 'C', 'A', 'C', 'A', 'C']
})
test['foo'] = test['foo'].astype('category').cat.set_categories(['small', 'medium', 'large'], ordered=True)
print(test.groupby(['bar', 'foo']).size())
print(test.groupby(['bar', 'foo']).size().unstack())
# output
bar foo
A medium 1
large 3
C small 1
medium 1
large 2
dtype: int64
foo medium large small
bar
A 1.0 3.0 NaN
C 1.0 2.0 1.0
I originally encountered this when using pd.cut
to group rows into bins, but an explicitly ordered category I thought made a clearer example. It's also very easy to end up in this situation when filtering a large dataframe.
Expected Output
foo small medium large
bar
A NaN 1.0 3.0
C 1.0 1.0 2.0
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-64-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None