Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605

nmusolino · 2017-09-20T19:27:57Z

Steps to reproduce

In [1]: import pandas

In [2]: df = pandas.DataFrame({'type': pandas.Categorical(['AAA', 'AAA', 'B', 'C']),
   ...:                        'voltage': pandas.Series([1.5, 1.5, 1.5, 1.5]),
   ...:                        'treatment': pandas.Categorical(['T', 'C', 'T', 'C'])})

In [3]: df.groupby(['treatment', 'type']).count()
Out[3]:
                voltage
treatment type
C         AAA       1.0
          B         NaN
          C         1.0
T         AAA       1.0
          B         1.0
          C         NaN

In [4]: df.groupby(['treatment', 'type'])['voltage'].count()
Out[4]:
treatment  type
C          AAA     1
           C       1
T          AAA     1
           B       1
Name: voltage, dtype: int64

Problem description

When performing a groupby on categorical columns, categories with empty groups should be present in output. That is, the multi-index of the object returned by count() should contain the Cartesian product of all the labels of the first categorical column ("treatment" in the example above) and the second categorical column ("type") by which the grouping was performed.

The behavior in cell [3] above is correct. But in cell [4], after obtaining a pandas.core.groupby.SeriesGroupBy object, the series returned by the count() method does not have entries for all levels of the "type" categorical.

Expected Output

The output from cell [4] should be equivalent to this output, with length 6, and include values for the index values (C, B) and (T, C).

In [5]: df.groupby(['treatment', 'type']).count().squeeze()
Out[5]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64

Workaround

Perform column access after calling count():

In [7]: df.groupby(['treatment', 'type']).count()['voltage']
Out[7]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

louispotok · 2017-09-29T05:48:15Z

Worth noting that .size() also drops the 0-counts for both Series and GroupBy. I know size and count are different but I would expect them to make the same indexing decision here.

cottrell · 2018-01-09T23:08:59Z

The desired output is the Series one. You definitely don't want the cartesian product of your groupby cols. The dataframe one is wrong or at least I can't see why you would want that as the default behaviour. You will explode memory.

cottrell · 2018-01-09T23:13:27Z

And to be more constructive, I would imagine in your use case you want some sort of reindex first using something like this stuff:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.MultiIndex.from_product.html

OliverHofkens · 2019-09-24T07:00:17Z

This issue was created before the introduction of the observed keyword in groupby(), but it seems to me that with the default of observed=False all categories should be present in the output as @nmusolino expects.

On first sight this is true for most aggregations, but count() seems to be an exception:

pdf = pd.DataFrame({
    "category_1": pd.Categorical(list("AABBCC"), categories=list("ABCDEF")),
    "category_2": pd.Categorical(list("ABC") * 2, categories=list("ABCDEF")),
    "value": [0.1] * 6
})

pdf.groupby(["category_1", "category_2"])["value"].sum()  # All categories present
pdf.groupby(["category_1", "category_2"])["value"].mean()  # All categories present
pdf.groupby(["category_1", "category_2"])["value"].min()  # All categories present
pdf.groupby(["category_1", "category_2"])["value"].count()  # Only observed present!

So I do think this is a bug that's still present.

…abels (#17605) (#29690)

…abels (pandas-dev#17605) (pandas-dev#29690)

cottrell · 2019-12-04T16:14:20Z

A severe warning should be issued when this gets pushed in the what's new as it will cause memory blow-ups if anyone is not using the observed=True non-default on sparsely observed high-arity categoricals.

jreback · 2019-12-04T17:33:07Z

@cottrell would welcome a memory benchmark in asv to see for sure

cottrell · 2019-12-07T18:53:05Z

@jreback is there a way of doing that though without killing the test framework? I don't think it is a test-worthy case really ... I mean simply that if you have 20k rows indexed by three cols with arity 10k x 10k x 10k, you will get a cube ravelled to 1e12 rows with the default settings. Setting observed=True gives < 20k rows.

The new default is fine, probably best folks learn to turn off the cartesian expansion. But could hit people if they upgrade old code.

jreback · 2019-12-08T15:32:26Z

@cottrell you can show a simliar effect in a much smaller df, e.g i don't see why 10 x 10 x 10 wouldn't work

cottrell · 2019-12-09T10:11:20Z

I don't think we are talking about the same thing.

A reasonable test to block this default change would have been any test that fails due to explosion of dimensions when observed=False.

The test would need to run and try to produce an array too large to compute. If the test was runnable with observed=False, then it would have been an invalid test.

As the new default is in, there is nothing to block anymore and this kind of test has no value in the current state. It is probably the good state since now anyway since everything must be explicit in high arity cases.

Below is the example above in the two cases for ref.

…abels (pandas-dev#17605) (pandas-dev#29690)

nmusolino changed the title ~~Series groupby does not included zero or nan counts for categoricals, unlike DataFrame groupby~~ Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby Sep 20, 2017

gfyoung added the Groupby label Sep 20, 2017

OliverHofkens mentioned this issue Sep 23, 2019

Categories lost after groupby + agg dask/dask#5419

Closed

OliverHofkens mentioned this issue Oct 9, 2019

Fix for #5419: Categories lost after groupby + agg dask/dask#5423

Merged

2 tasks

OliverHofkens mentioned this issue Nov 18, 2019

BUG: Series groupby does not include nan counts for all categorical labels (#17605) #29690

Merged

5 tasks

jreback added this to the 1.0 milestone Nov 20, 2019

jreback added the Bug label Nov 20, 2019

jreback closed this as completed in #29690 Nov 20, 2019

jreback pushed a commit that referenced this issue Nov 20, 2019

BUG: Series groupby does not include nan counts for all categorical l…

c5a1f9e

…abels (#17605) (#29690)

jacobaustin123 pushed a commit to jacobaustin123/pandas that referenced this issue Nov 20, 2019

BUG: Series groupby does not include nan counts for all categorical l…

0939732

…abels (pandas-dev#17605) (pandas-dev#29690)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

BUG: Series groupby does not include nan counts for all categorical l…

933181a

…abels (pandas-dev#17605) (pandas-dev#29690)

proost pushed a commit to proost/pandas that referenced this issue Dec 19, 2019

BUG: Series groupby does not include nan counts for all categorical l…

61f7e29

…abels (pandas-dev#17605) (pandas-dev#29690)

simonjayhawkins mentioned this issue Apr 16, 2020

Missing values in ordered category breaks sorting of unstacked columns #28597

Closed

cottrell mentioned this issue Jan 21, 2021

PERF: groupby with many empty groups memory blowup #30552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605

Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605

nmusolino commented Sep 20, 2017 •

edited

Loading

louispotok commented Sep 29, 2017

cottrell commented Jan 9, 2018 •

edited

Loading

cottrell commented Jan 9, 2018

OliverHofkens commented Sep 24, 2019

cottrell commented Dec 4, 2019

jreback commented Dec 4, 2019

cottrell commented Dec 7, 2019

jreback commented Dec 8, 2019

cottrell commented Dec 9, 2019

Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605

Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605

Comments

nmusolino commented Sep 20, 2017 • edited Loading

Steps to reproduce

Problem description

Expected Output

Workaround

Output of pd.show_versions()

louispotok commented Sep 29, 2017

cottrell commented Jan 9, 2018 • edited Loading

cottrell commented Jan 9, 2018

OliverHofkens commented Sep 24, 2019

cottrell commented Dec 4, 2019

jreback commented Dec 4, 2019

cottrell commented Dec 7, 2019

jreback commented Dec 8, 2019

cottrell commented Dec 9, 2019

nmusolino commented Sep 20, 2017 •

edited

Loading

Output of `pd.show_versions()`

cottrell commented Jan 9, 2018 •

edited

Loading