Groupby on categoricals includes unused categories #17631

barbester · 2017-09-22T17:11:45Z

In a nutshell:

groupby on a single categorical column with prescribed categories incorrectly returns results for all categories, even those that are not actually present in the DataFrame.
In addition, when aggregating after grouping on categoricals (groupby.sum and the likes), with both prescribed and non-prescribed categories, we get values for all possible combinations of categories, including those not present in the DataFrame.

Problem description and code samples

Case 1: group by a single column

Consider the following code where we define a DataFrame with a categorical column with prescribed categories.

import pandas as pd

labels = pd.Series(list('abcbabcab'))
df = pd.DataFrame({'label1': labels,
                   'x': [0, 1, 1, 1, 1, 0, 1, 1, 1]})
df.label1 = labels.astype('category', categories=list('abcdef'))

When we group by the categorical label1 column and aggregate, we incorrectly get results for all prescribed categories, including those that are not present in the DataFrame (see rows below where the value of x is NaN):

In[6]: df.groupby('label1').sum()
Out[6]:
          x
label1     
a       2.0
b       3.0
c       2.0
d       NaN
e       NaN
f       NaN

The elements in excess are already present in the groupby object:

In[7]: df.groupby('label1').groups
Out[7]:
{'a': Int64Index([0, 4, 7], dtype='int64'),
 'b': Int64Index([1, 3, 5, 8], dtype='int64'),
 'c': Int64Index([2, 6], dtype='int64'),
 'd': Int64Index([], dtype='int64'),
 'e': Int64Index([], dtype='int64'),
 'f': Int64Index([], dtype='int64')}

The above doesn't happen if the categories for the label1 column are not prescribed or if the column is not converted to a categorical at all.

Case 2: group by multiple columns

Consider now the case where we have two categorical columns:

df = pd.DataFrame({'label1': labels,
                   'label2': labels,
                   'x': [0, 1, 1, 1, 1, 0, 1, 1, 1]})
df.label1 = labels.astype('category', categories=list('abcdef'))
df.label2 = labels.astype('category', categories=list('abcdef'))

Contrary to the single column case above, we do get the correct group labels when we group by both categorical columns label1 and label2:

In[11]: df.groupby(['label1', 'label2']).groups
Out[11]:
{('a', 'a'): Int64Index([0, 4, 7], dtype='int64'),
 ('b', 'b'): Int64Index([1, 3, 5, 8], dtype='int64'),
 ('c', 'c'): Int64Index([2, 6], dtype='int64')}

But we still get incorrect results if we aggregate:

In[12]: df.groupby(['label1', 'label2']).sum()
Out[12]:
                 x
label1 label2     
a      a       2.0
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN
b      a       NaN
       b       3.0
       c       NaN
       d       NaN
       e       NaN
       f       NaN
c      a       NaN
       b       NaN
       c       2.0
       d       NaN
       e       NaN
       f       NaN
d      a       NaN
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN
e      a       NaN
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN
f      a       NaN
       b       NaN
       c       NaN
       d       NaN
       e       NaN
       f       NaN

Note: The aggregation shows the same inccorect behaviour also when we don't prescribe the categories:

In[13]: df.label1 = labels.astype('category')
   ...: df.label2 = labels.astype('category')
In[14]: df.groupby(['label1', 'label2']).sum()
Out[14]:
                 x
label1 label2     
a      a       2.0
       b       NaN
       c       NaN
b      a       NaN
       b       3.0
       c       NaN
c      a       NaN
       b       NaN
       c       2.0

Expected Output

For Case 1:

In[6]: df.groupby('label1').sum()
Out[6]:
        x
label1   
a       2
b       3
c       2

In[7]: df.groupby('label1').groups
Out[7]:
{'a': Int64Index([0, 4, 7], dtype='int64'),
 'b': Int64Index([1, 3, 5, 8], dtype='int64'),
 'c': Int64Index([2, 6], dtype='int64')}

For Case 2:

In[19]: df.groupby(['label1', 'label2']).sum()
Out[19]:
               x
label1 label2   
a      a       2
b      b       3
c      c       2

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-09-22T20:14:23Z

duplicate of #17594, see expl there.

jreback closed this as completed Sep 22, 2017

jreback added Categorical Categorical Data Type Groupby Duplicate Report Duplicate issue or pull request labels Sep 22, 2017

jreback added this to the No action milestone Sep 22, 2017

KishManani mentioned this issue Jan 7, 2023

[ENH] Improve the handling of CategoricalIndex type when handling panel and hierarchical data. sktime/sktime#4062

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby on categoricals includes unused categories #17631

Groupby on categoricals includes unused categories #17631

barbester commented Sep 22, 2017

INSTALLED VERSIONS

jreback commented Sep 22, 2017

Groupby on categoricals includes unused categories #17631

Groupby on categoricals includes unused categories #17631

Comments

barbester commented Sep 22, 2017

In a nutshell:

Problem description and code samples

Case 1: group by a single column

Case 2: group by multiple columns

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Sep 22, 2017

Output of `pd.show_versions()`