Skip to content

BUG: pivot table bug with Categorical indexes, #10993 #11371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 20, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Oct 19, 2015

closes #10993
replaces #10989

So issue #10993 involves the insertion of a key into a multi-index that has as one of its levels a CategoricalIndex. This causes the semantics to break down because we are inserting an new key.

Existing

In [16]: df = DataFrame({'A' : [1,2], 'B' : [3,4] })

In [17]: df.columns = MultiIndex([pd.CategoricalIndex(list('ab')),[1,2]],[[0,1],[0,1]])

In [3]: df.columns.levels[0]
Out[3]: CategoricalIndex([u'a', u'b'], categories=[u'a', u'b'], ordered=False, dtype='category')

In [18]: df
Out[18]: 
   a  b
   1  2
0  1  3
1  2  4

In [19]: df.columns
Out[19]: 
MultiIndex(levels=[[u'a', u'b'], [1, 2]],
           labels=[[0, 1], [0, 1]])

In [20]: df[('c',3)] = 5
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

New

In [4]: df[('c',3)] = 5

In [5]: df
Out[5]: 
   a  b  c
   1  2  3
0  1  3  5
1  2  4  5

In [8]: df.columns.levels[0]
Out[8]: CategoricalIndex([u'a', u'b', u'c'], categories=[u'a', u'b', u'c'], ordered=False, dtype='category')

The only issue that was slightly controversial is that .insert will retain the ordered attribute (and new categories go to the end). while .append will always have ordered=False. In theory we could do the same, but .append is used to generally append another CategoricalIndex, so you would have some possiblity of interleaving of the 'ordered' categories (IOW, if self has [1,2,3] and other has [3,2,1,4], then the result will be [1,2,3,4].

We could raise if we have mixed ordering (e.g. self is ordered=False, other is ordered=True).

Of course the user is free to reorder and such, but the default should be intuitive.

Futher note that we can now concat pandas objects with CategoricalIndexes (I don't think was specified before, certainly not tested), e.g.

In [1]: df = DataFrame({'A' : np.arange(5)},index=pd.CategoricalIndex(list('aabbc')))

In [2]: df2 = DataFrame({'A' : np.arange(5)},index=pd.CategoricalIndex(list('bbcde')))

In [3]: pd.concat([df,df2])
Out[3]: 
   A
a  0
a  1
b  2
b  3
c  4
b  0
b  1
c  2
d  3
e  4

In [4]: pd.concat([df,df2]).index
Out[4]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'b', u'b', u'c', u'd', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=False, dtype='category')

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type labels Oct 19, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 19, 2015
@jreback
Copy link
Contributor Author

jreback commented Oct 19, 2015

@jreback jreback force-pushed the jakevdp-pivot-table-categorical branch from 603e99b to 76d7c9a Compare October 19, 2015 12:58
@jankatins
Copy link
Contributor

My argument was always that Categorical should not gain any more public methods, as I think the "normal" case for working with lickert scales and such things is IMO satisfied.

Reading #10989 I think this is something which can't be solved without problems: converting to a object index will loose the order/... but keeping the categorical index but appending a value will also change the underlying data type: if I work with the underlying "data" (aka the Categorical()) I get wrong results because the the added All.

Given this two cases, I would opt for the first one, as it is similar to what happens the integer ones:

>>> data.pivot_table('x', 'y', 'z').index
Int64Index([0, 1], dtype='int64', name='y')
>>> data.pivot_table('x', 'y', 'z', margins=True).index
Index([0, 1, 'All'], dtype='object', name='y')

A third solution would be to add a check that margin does not work with categorical.

@sinhrks
Copy link
Member

sinhrks commented Oct 19, 2015

I think adding new category is useful, but have no good idea how ordered should work. Following shows the current branch does.

Appending ordered=True and orderd=True results in ordered=True, this looks OK.

c1 = pd.CategoricalIndex(list('abc'), ordered=True)
c2 = pd.CategoricalIndex(list('cde'), ordered=True)
c1.append(c2)
# CategoricalIndex([u'a', u'b', u'c', u'c', u'd', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=True, dtype='category')

However, there can be a case which category orders conflict. This case should be orderd=False?

c1 = pd.CategoricalIndex(list('abc'), ordered=True)
c2 = pd.CategoricalIndex(list('e'), categories=list('ecba'), ordered=True)
c1.append(c2)
# CategoricalIndex([u'a', u'b', u'c', u'e'], categories=[u'a', u'b', u'c', u'e'], ordered=True, dtype='category')

Appending ordered=True and orderd=False results in ordered=True. I think this should be ordered=False because we can't know second categories order.

c1 = pd.CategoricalIndex(list('abc'), ordered=True)
c2 = pd.CategoricalIndex(list('cde'), ordered=False)
c1.append(c2)
# CategoricalIndex([u'a', u'b', u'c', u'c', u'd', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=True, dtype='category')

@jankatins
Copy link
Contributor

I think about each instance of a Category as the same as int or float: each defines a space of valid values. So Categorical(..., categories=[1,2,3]) is as different to Categorical(..., categories=[1,2,3,4]) as int is to float: int defines a range of valid values that include only all values form -maxint...maxint and it can't take things like 0.5. And it's ordered... So not unlike Category.

>>> a = pd.Index([1,2,3])
>>> b = pd.Index([1.,2.,3.])
>>> a.append(b)
Float64Index([1.0, 2.0, 3.0, 1.0, 2.0, 3.0], dtype='float64')
>>> c = pd.Index(list("abc"))
>>> a.append(c)
Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')

-> It's converted to an index type, which can take both data types.

IMO one could argue that appending two categoricals should "upcast" to object, as that's at least deterministic: categories are labels for groups, so it shouldn't matter what I label them (e.g. what should happend in the above case if I relabel a Categorical which was [1,2,3] to ["Group 1", "Group 2", "Group 3"]). And that it's not clear what should happen to order is another indicator that it's best to do a upcast to float like it happened in the int+string case.

@jreback
Copy link
Contributor Author

jreback commented Oct 19, 2015

@JanSchulz not gaining any methods here at all. It doesn't make sense to .append/.insert directly on a Categorical. We are talking about a CategoricalIndex which has these operations naturally and work for values IN the categories already. The CI are actually more flexible by definition. These methods should work.

Now upcasting to object is a quite reasonable argument, though we currently allow index preservation when the new values ARE in the categories (IOW its still a CI). Further pandas in general just works, rarely raising these types of conversion errors (which is why this should do something logical and deterministic).

So we need to allow index insert/append otherwise CI are crippled compared to other indexes. Since we are creating new indexes this is simply like .add_categories then insert.

To make this somewhat simple we need a convention.

  • values in the categories -> CategoricalIndex, order is preserved
  • not in categories
    • a) convert to object
    • b) add to categories, insert preserves order, append sets ordered=False
    • c) add to categories, insert & append preserve order
    • d) like b) but convert to object either index is ordered

so appears @JanSchulz is for a)
I proposed b)

anyone else?

@jorisvandenbossche
Copy link
Member

@jreback And there is also option e) raise an error (the current situation)

I am following @JanSchulz that we should not just change the categories of an object by appending or inserting. The categories are a fundamental part of the dtype, and IMHO should not be changed lightly by some operation (unless it is explicitely doing that like add_categories)

From that position, I would be for either convert to object or either raise an error (and 'fix the symptoms' in places where this leads to errors (@jakevdp 's approach for the pivot_table).
Given that the 'fix the symptoms' solution will also generally be to convert it to object in these special cases, the option to generally convert to object dtype seems more consistent (as well more consistent with eg ints and strings). However, the error message can also be useful in some cases to not unexpectedly end up with objects and loose the categories information.

@jreback jreback force-pushed the jakevdp-pivot-table-categorical branch from 76d7c9a to f5442ba Compare October 20, 2015 12:35
@jreback jreback changed the title BUG/API: CategoricalIndex.append/insert allow insertion of new categories BUG: pivot table bug with Categorical indexes, #10993 Oct 20, 2015
@jreback
Copy link
Contributor Author

jreback commented Oct 20, 2015

I tried to make these auto-coerce to object when presented with a value not in the categories, but it doesn't quite work (e.g. makes indexing 'work' sometimes which is bad).

ok, so reverted to use a modified version of @jakevdp soln; had an embedded bug which is fixed as well.

thanks for the commentary @JanSchulz @jorisvandenbossche

@jreback jreback force-pushed the jakevdp-pivot-table-categorical branch from f5442ba to 97043da Compare October 20, 2015 12:47
@jreback jreback force-pushed the jakevdp-pivot-table-categorical branch from 97043da to 7ca878e Compare October 20, 2015 13:01
jreback added a commit that referenced this pull request Oct 20, 2015
BUG: pivot table bug with Categorical indexes, #10993
@jreback jreback merged commit db884d9 into pandas-dev:master Oct 20, 2015
@jorisvandenbossche
Copy link
Member

@jreback What do you mean xith "doesn't quite work"?

@jreback
Copy link
Contributor Author

jreback commented Oct 20, 2015

oh, I tried the strategy of having .insert/.append on a CategoricalIndex convert to Index if the values are not in the categories. But I couldn't get it to work properly with nice semantics, IOW broke stuff in indexing)

@jorisvandenbossche
Copy link
Member

ok!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Categorical Categorical Data Type Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants