Groupby inconsistency with categorical values #17032

aplavin · 2017-07-20T06:20:38Z

Code Sample

df = pd.DataFrame([{'i': i, 's': str(i)} for i in range(5)])

# this gives two rows with counts of one, as expected
df.iloc[:2].groupby('s').size()

df['s'] = df['s'].astype('category')
# this gives five rows, two of those having counts of one and others of zero
df.iloc[:2].groupby('s').size()

Problem description

I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to behave the same whether the string column is left as-is or converted to category. If that's expected behavior for categorical groupby to keep empty groups, is it possible to at least provide a boolean parameter to groupby like empty_groups? Or maybe even a simpler solution exists, but I couldn't find it.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.16-gentoo
machine: x86_64
processor: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-07-20T06:56:43Z

@aplavin : We've had this issue before with crosstab (xref #16367), and in that discussion, we seemed to be going with keeping all categories in the count. Thus, I would consider the output from your example expected in light of that discussion.

I'm a little wary about adding a parameter to .size() because it doesn't make much sense outside of the context of categorical. We want the interface to be as uniform as possible.

That being said, this method is a little different from crosstab, so perhaps workarounds (I can think of some but none feel appealing) might be feasible.

@jreback @jorisvandenbossche @TomAugspurger

aplavin · 2017-07-20T06:59:27Z

@gfyoung I'm not talking about adding a parameter to .size(), but to .groupby() so that it apply to all further operations. size was just an example.

As of now the only way I see to have nonempty groups only (as in non-categorical groupby) is something like df.groupby(...).apply(f).pipe(lambda d: d[~d.isnull().any(1)]), and this still would call f for all the empty groups.

gfyoung · 2017-07-20T07:06:25Z

I'm not talking about adding a parameter to .size(), but to .groupby()

I see, though outside of categorical, what would this parameter mean? I would still be wary of adding another parameter and functionality just because of how special-cased it is in the context of groupby.

toobaz · 2017-07-20T07:32:24Z

xref #8559 (comment)

Rather than changing some general purpose method(s) such as size() and groupby(), maybe we could decide to solve this with a handy drop_empty() method for categoricals?

aplavin · 2017-07-20T07:43:08Z

@toobaz if I understand you correctly, then it probably won't work in general. Suppose there is column a with values 1 and 2, and column b with values 3 and 4, but the only occuring combinations are 1,3 and 2,4. Then if we group by both of them (as categorical) I would expect to have two groups (1,3 and 2,4), and not all four. Actually it was very surprising to see that simply changing data type for performance reasons (in my case) significantly changed the behavior.

gfyoung · 2017-07-20T08:11:16Z

I think in light of @toobaz 's reference, I think it brings up a larger question of how we want to handle counting of Categorical groups that have no presence?

It might be worthwhile to create one umbrella issue that addresses this API standpoint and close the duplicates, as this issue seems to have crept up on several occasions.

@jreback : what do you think?

toobaz · 2017-07-20T08:18:32Z

@toobaz if I understand you correctly, then it probably won't work in general

Yes, you understood me correctly... and yes, you're right...

jreback · 2017-07-20T12:32:17Z

see much discussion #8559

closing as a duplicate.

gfyoung added Categorical Categorical Data Type Groupby labels Jul 20, 2017

jreback closed this as completed Jul 20, 2017

jreback added the Duplicate Report Duplicate issue or pull request label Jul 20, 2017

gfyoung added this to the No action milestone Jul 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby inconsistency with categorical values #17032

Groupby inconsistency with categorical values #17032

aplavin commented Jul 20, 2017

INSTALLED VERSIONS

gfyoung commented Jul 20, 2017 •

edited

Loading

aplavin commented Jul 20, 2017 •

edited

Loading

gfyoung commented Jul 20, 2017

toobaz commented Jul 20, 2017

aplavin commented Jul 20, 2017 •

edited

Loading

gfyoung commented Jul 20, 2017

toobaz commented Jul 20, 2017

jreback commented Jul 20, 2017

Groupby inconsistency with categorical values #17032

Groupby inconsistency with categorical values #17032

Comments

aplavin commented Jul 20, 2017

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Jul 20, 2017 • edited Loading

aplavin commented Jul 20, 2017 • edited Loading

gfyoung commented Jul 20, 2017

toobaz commented Jul 20, 2017

aplavin commented Jul 20, 2017 • edited Loading

gfyoung commented Jul 20, 2017

toobaz commented Jul 20, 2017

jreback commented Jul 20, 2017

Output of `pd.show_versions()`

gfyoung commented Jul 20, 2017 •

edited

Loading

aplavin commented Jul 20, 2017 •

edited

Loading

aplavin commented Jul 20, 2017 •

edited

Loading