Skip to content

groupby categorical column fails with unstack #11558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mikepqr opened this issue Nov 9, 2015 · 3 comments
Closed

groupby categorical column fails with unstack #11558

mikepqr opened this issue Nov 9, 2015 · 3 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@mikepqr
Copy link

mikepqr commented Nov 9, 2015

Replicating example

In [1]: df = pd.DataFrame([[1,2],[3,4]],columns=pd.CategoricalIndex(list('AB')))

In [2]: df.describe()
AttributeError: 'DataFrame' object has no attribute 'value_counts'

The behaviour in this notebook seems like a bug to me. This is pandas 0.17.0.

In it, g and gcat are the results of two df.groupby(['medium', 'artist']).count().unstack() operations. The only difference is that one of those operations is on df where one of the columns that the groupby operates over has been converted to Categorical.

g and gcat behave very differently. I've tried to pin this down to the exact operation in the split-apply-combine that causes the problem without much luck.

Slicing a column out of g returns a Series as expected, while slicing a column out of gcat returns a DataFrame (see cells 4 and 5).

g.describe() works as expected, but gcat.describe() raises the exception

AttributeError: 'DataFrame' object has no attribute 'value_counts'

and g['painting'] + g['sculpture'] works as expected but g['painting'] + g['sculpture'] raises

Exception: Data must be 1-dimensional
@jreback jreback added Bug Prio-medium Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 9, 2015
@jreback jreback added this to the 0.17.1 milestone Nov 9, 2015
@jreback
Copy link
Contributor

jreback commented Nov 9, 2015

this is a tricky bug actually; when indexing into a frame that has duplicates (or is a CategoricalIndex), you get a frame back from .iteritems even though it may be unique. So there are 2 paths here that need checking actually.

@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015
@luispedro
Copy link

This seems related to this incongruency I also ran into:

>>> data = pd.DataFrame([[1,2,3],[3,4,5]], index=['one', 'two'])
>>> print(data.ix['one'].shape)
(3,)
>>> data = pd.DataFrame([[1,2,3],[3,4,5]], index=pd.Categorical(['one', 'two']))
>>> print(data.ix['one'].shape)
(1, 3)

If this dataframe is coming from a groupby, then it's guaranteed to be uniquely indexed, so it's doubly inconsistent.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2016

this has to do with how we handle uniques vs non-uniques. A Categorical Index is by definition non-unique (its actually unique in this case).

But this might be a a buggie.

In [36]: data1 = pd.DataFrame([[1,2,3],[3,4,5]], index=['one', 'two'])

In [37]: data2 = pd.DataFrame([[1,2,3],[3,4,5]], index=pd.Categorical(['one', 'two']))

In [40]: data1.ix['one']
Out[40]: 
0    1
1    2
2    3
Name: one, dtype: int64

In [41]: data2.ix['one']
Out[41]: 
     0  1  2
one  1  2  3

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 12, 2016
jreback pushed a commit that referenced this issue Mar 15, 2016
related to #11558

Author: sinhrks <[email protected]>

Closes #12531 from sinhrks/cat_get_loc and squashes the following commits:

2749b62 [sinhrks] BUG: CategoricalIndex.get_loc returns array even if it is unique
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants