Skip to content

Groupby inconsistency with categorical values #17032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aplavin opened this issue Jul 20, 2017 · 8 comments
Closed

Groupby inconsistency with categorical values #17032

aplavin opened this issue Jul 20, 2017 · 8 comments
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Groupby

Comments

@aplavin
Copy link

aplavin commented Jul 20, 2017

Code Sample

df = pd.DataFrame([{'i': i, 's': str(i)} for i in range(5)])

# this gives two rows with counts of one, as expected
df.iloc[:2].groupby('s').size()

df['s'] = df['s'].astype('category')
# this gives five rows, two of those having counts of one and others of zero
df.iloc[:2].groupby('s').size()

Problem description

I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to behave the same whether the string column is left as-is or converted to category. If that's expected behavior for categorical groupby to keep empty groups, is it possible to at least provide a boolean parameter to groupby like empty_groups? Or maybe even a simpler solution exists, but I couldn't find it.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.16-gentoo
machine: x86_64
processor: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.2
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Categorical Categorical Data Type Groupby labels Jul 20, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2017

@aplavin : We've had this issue before with crosstab (xref #16367), and in that discussion, we seemed to be going with keeping all categories in the count. Thus, I would consider the output from your example expected in light of that discussion.

I'm a little wary about adding a parameter to .size() because it doesn't make much sense outside of the context of categorical. We want the interface to be as uniform as possible.

That being said, this method is a little different from crosstab, so perhaps workarounds (I can think of some but none feel appealing) might be feasible.

@jreback @jorisvandenbossche @TomAugspurger

@aplavin
Copy link
Author

aplavin commented Jul 20, 2017

@gfyoung I'm not talking about adding a parameter to .size(), but to .groupby() so that it apply to all further operations. size was just an example.

As of now the only way I see to have nonempty groups only (as in non-categorical groupby) is something like df.groupby(...).apply(f).pipe(lambda d: d[~d.isnull().any(1)]), and this still would call f for all the empty groups.

@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2017

I'm not talking about adding a parameter to .size(), but to .groupby()

I see, though outside of categorical, what would this parameter mean? I would still be wary of adding another parameter and functionality just because of how special-cased it is in the context of groupby.

@toobaz
Copy link
Member

toobaz commented Jul 20, 2017

xref #8559 (comment)

Rather than changing some general purpose method(s) such as size() and groupby(), maybe we could decide to solve this with a handy drop_empty() method for categoricals?

@aplavin
Copy link
Author

aplavin commented Jul 20, 2017

@toobaz if I understand you correctly, then it probably won't work in general. Suppose there is column a with values 1 and 2, and column b with values 3 and 4, but the only occuring combinations are 1,3 and 2,4. Then if we group by both of them (as categorical) I would expect to have two groups (1,3 and 2,4), and not all four. Actually it was very surprising to see that simply changing data type for performance reasons (in my case) significantly changed the behavior.

@gfyoung
Copy link
Member

gfyoung commented Jul 20, 2017

I think in light of @toobaz 's reference, I think it brings up a larger question of how we want to handle counting of Categorical groups that have no presence?

It might be worthwhile to create one umbrella issue that addresses this API standpoint and close the duplicates, as this issue seems to have crept up on several occasions.

@jreback : what do you think?

@toobaz
Copy link
Member

toobaz commented Jul 20, 2017

@toobaz if I understand you correctly, then it probably won't work in general

Yes, you understood me correctly... and yes, you're right...

@jreback
Copy link
Contributor

jreback commented Jul 20, 2017

see much discussion #8559

closing as a duplicate.

@jreback jreback closed this as completed Jul 20, 2017
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Jul 20, 2017
@gfyoung gfyoung added this to the No action milestone Jul 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

4 participants