Skip to content

BUG: groupby nunique with Categorical and missing categories gives ValueError #11635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Nov 18, 2015 · 8 comments
Labels
Bug Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version

Comments

@jorisvandenbossche
Copy link
Member

From SO: http://stackoverflow.com/questions/33775560/how-to-group-categorical-values-in-pandas

import pandas as pd

df = pd.DataFrame()
df['A'] = ['C1', 'C1', 'C2', 'C2', 'C3', 'C3']
df['B'] = [1,2,3,4,5,6]

df['A'] = df.loc[:,'A'].astype('category')
df2 = df[0:3]

result = df2.groupby(by='A')['B'].nunique()

This worked in 0.16.2, but is broken in 0.17.0 (it gives ValueError: Wrong number of items passed 2, placement implies 3)

It is related with the fact that not all categories are present in the actual values (due to the slicing).

Probably related with new nunique implementation in 0.16.2 (#10894, #11079)

@jorisvandenbossche jorisvandenbossche added Bug Groupby Regression Functionality that used to work in a prior pandas version labels Nov 18, 2015
@jorisvandenbossche jorisvandenbossche added this to the 0.17.1 milestone Nov 18, 2015
@jorisvandenbossche jorisvandenbossche added the Categorical Categorical Data Type label Nov 18, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 18, 2015
@jorisvandenbossche
Copy link
Member Author

cc @behzadnouri

@behzadnouri
Copy link
Contributor

xref #10694

@jorisvandenbossche
Copy link
Member Author

@behzadnouri ah, I already forgot that issue :-) Indeed similar, but not fully related I think. The difference is also that the other issue never worked, while this one worked in 0.16.2 and is a regression due to the new SeriesGroupBy.nunique implementation.

@behzadnouri
Copy link
Contributor

the problem is that everything needs special casing for categoricals otherwise it does not work, and has caused a lot of code bloat. these should be fixed by fixing categoricals not patching here and there.

here the problem is that the resulting index from groupby on categorical is not same as groupby on anything else, even though i do not see any use case for that:

https://github.com/pydata/pandas/blame/23ce9807a6990841f13a36087ae4d96d34315cdb/pandas/core/groupby.py#L3571

>>> df = DataFrame({'B': [1, 2]})
>>> df['A'] = Categorical(list('XY'), list('WXYZ'))
>>> df
   B  A
0  1  X
1  2  Y
>>> df.groupby('A').grouper.result_index  # why W and Z are in the index?
CategoricalIndex([u'W', u'X', u'Y', u'Z'], categories=[u'W', u'X', u'Y', u'Z'], ordered=False, name=u'A', dtype='category')
>>> df.groupby('A').sum()  # what was the gain here?
    B
A
W NaN
X   1
Y   2
Z NaN

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

@behzadnouri because that's the definition of Categorical. The point is to carry the categories around and not have them disappear just because they are not represented in a particular case.

@toobaz
Copy link
Member

toobaz commented Feb 22, 2016

I was about to file a new bug about this groupby() behaviour... I agree categories should be carried in the Categorical object, but in ordinary methods, such as groupby(), I think that having them behave differently from other types is unexpected and not particularly useful.

@toobaz
Copy link
Member

toobaz commented Feb 22, 2016

Oh, OK, I'm not saying anything new: #8559

@jreback
Copy link
Contributor

jreback commented Jul 20, 2017

covered / dupe by #8559

@jreback jreback closed this as completed Jul 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants