BUG: pivot table bug with Categorical indexes, #10993 #11371

jreback · 2015-10-19T11:39:15Z

So issue #10993 involves the insertion of a key into a multi-index that has as one of its levels a CategoricalIndex. This causes the semantics to break down because we are inserting an new key.

Existing

In [16]: df = DataFrame({'A' : [1,2], 'B' : [3,4] })

In [17]: df.columns = MultiIndex([pd.CategoricalIndex(list('ab')),[1,2]],[[0,1],[0,1]])

In [3]: df.columns.levels[0]
Out[3]: CategoricalIndex([u'a', u'b'], categories=[u'a', u'b'], ordered=False, dtype='category')

In [18]: df
Out[18]: 
   a  b
   1  2
0  1  3
1  2  4

In [19]: df.columns
Out[19]: 
MultiIndex(levels=[[u'a', u'b'], [1, 2]],
           labels=[[0, 1], [0, 1]])

In [20]: df[('c',3)] = 5
TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

New

In [4]: df[('c',3)] = 5

In [5]: df
Out[5]: 
   a  b  c
   1  2  3
0  1  3  5
1  2  4  5

In [8]: df.columns.levels[0]
Out[8]: CategoricalIndex([u'a', u'b', u'c'], categories=[u'a', u'b', u'c'], ordered=False, dtype='category')

The only issue that was slightly controversial is that .insert will retain the ordered attribute (and new categories go to the end). while .append will always have ordered=False. In theory we could do the same, but .append is used to generally append another CategoricalIndex, so you would have some possiblity of interleaving of the 'ordered' categories (IOW, if self has [1,2,3] and other has [3,2,1,4], then the result will be [1,2,3,4].

We could raise if we have mixed ordering (e.g. self is ordered=False, other is ordered=True).

Of course the user is free to reorder and such, but the default should be intuitive.

Futher note that we can now concat pandas objects with CategoricalIndexes (I don't think was specified before, certainly not tested), e.g.

In [1]: df = DataFrame({'A' : np.arange(5)},index=pd.CategoricalIndex(list('aabbc')))

In [2]: df2 = DataFrame({'A' : np.arange(5)},index=pd.CategoricalIndex(list('bbcde')))

In [3]: pd.concat([df,df2])
Out[3]: 
   A
a  0
a  1
b  2
b  3
c  4
b  0
b  1
c  2
d  3
e  4

In [4]: pd.concat([df,df2]).index
Out[4]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'b', u'b', u'c', u'd', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=False, dtype='category')

TST: add test case from Issue pandas-dev#10989

jreback · 2015-10-19T11:40:02Z

cc @jorisvandenbossche @sinhrks @JanSchulz @jakevdp @shoyer

jankatins · 2015-10-19T13:53:08Z

My argument was always that Categorical should not gain any more public methods, as I think the "normal" case for working with lickert scales and such things is IMO satisfied.

Reading #10989 I think this is something which can't be solved without problems: converting to a object index will loose the order/... but keeping the categorical index but appending a value will also change the underlying data type: if I work with the underlying "data" (aka the Categorical()) I get wrong results because the the added All.

Given this two cases, I would opt for the first one, as it is similar to what happens the integer ones:

>>> data.pivot_table('x', 'y', 'z').index
Int64Index([0, 1], dtype='int64', name='y')
>>> data.pivot_table('x', 'y', 'z', margins=True).index
Index([0, 1, 'All'], dtype='object', name='y')

A third solution would be to add a check that margin does not work with categorical.

sinhrks · 2015-10-19T14:03:20Z

I think adding new category is useful, but have no good idea how ordered should work. Following shows the current branch does.

Appending ordered=True and orderd=True results in ordered=True, this looks OK.

c1 = pd.CategoricalIndex(list('abc'), ordered=True)
c2 = pd.CategoricalIndex(list('cde'), ordered=True)
c1.append(c2)
# CategoricalIndex([u'a', u'b', u'c', u'c', u'd', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=True, dtype='category')

However, there can be a case which category orders conflict. This case should be orderd=False?

c1 = pd.CategoricalIndex(list('abc'), ordered=True)
c2 = pd.CategoricalIndex(list('e'), categories=list('ecba'), ordered=True)
c1.append(c2)
# CategoricalIndex([u'a', u'b', u'c', u'e'], categories=[u'a', u'b', u'c', u'e'], ordered=True, dtype='category')

Appending ordered=True and orderd=False results in ordered=True. I think this should be ordered=False because we can't know second categories order.

c1 = pd.CategoricalIndex(list('abc'), ordered=True)
c2 = pd.CategoricalIndex(list('cde'), ordered=False)
c1.append(c2)
# CategoricalIndex([u'a', u'b', u'c', u'c', u'd', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=True, dtype='category')

jankatins · 2015-10-19T14:23:57Z

I think about each instance of a Category as the same as int or float: each defines a space of valid values. So Categorical(..., categories=[1,2,3]) is as different to Categorical(..., categories=[1,2,3,4]) as int is to float: int defines a range of valid values that include only all values form -maxint...maxint and it can't take things like 0.5. And it's ordered... So not unlike Category.

>>> a = pd.Index([1,2,3])
>>> b = pd.Index([1.,2.,3.])
>>> a.append(b)
Float64Index([1.0, 2.0, 3.0, 1.0, 2.0, 3.0], dtype='float64')
>>> c = pd.Index(list("abc"))
>>> a.append(c)
Index([1, 2, 3, 'a', 'b', 'c'], dtype='object')

-> It's converted to an index type, which can take both data types.

IMO one could argue that appending two categoricals should "upcast" to object, as that's at least deterministic: categories are labels for groups, so it shouldn't matter what I label them (e.g. what should happend in the above case if I relabel a Categorical which was [1,2,3] to ["Group 1", "Group 2", "Group 3"]). And that it's not clear what should happen to order is another indicator that it's best to do a upcast to float like it happened in the int+string case.

jreback · 2015-10-19T21:20:15Z

@JanSchulz not gaining any methods here at all. It doesn't make sense to .append/.insert directly on a Categorical. We are talking about a CategoricalIndex which has these operations naturally and work for values IN the categories already. The CI are actually more flexible by definition. These methods should work.

Now upcasting to object is a quite reasonable argument, though we currently allow index preservation when the new values ARE in the categories (IOW its still a CI). Further pandas in general just works, rarely raising these types of conversion errors (which is why this should do something logical and deterministic).

So we need to allow index insert/append otherwise CI are crippled compared to other indexes. Since we are creating new indexes this is simply like .add_categories then insert.

To make this somewhat simple we need a convention.

values in the categories -> CategoricalIndex, order is preserved
not in categories
- a) convert to object
- b) add to categories, insert preserves order, append sets ordered=False
- c) add to categories, insert & append preserve order
- d) like b) but convert to object either index is ordered

so appears @JanSchulz is for a)
I proposed b)

anyone else?

jorisvandenbossche · 2015-10-20T09:01:56Z

@jreback And there is also option e) raise an error (the current situation)

I am following @JanSchulz that we should not just change the categories of an object by appending or inserting. The categories are a fundamental part of the dtype, and IMHO should not be changed lightly by some operation (unless it is explicitely doing that like add_categories)

From that position, I would be for either convert to object or either raise an error (and 'fix the symptoms' in places where this leads to errors (@jakevdp 's approach for the pivot_table).
Given that the 'fix the symptoms' solution will also generally be to convert it to object in these special cases, the option to generally convert to object dtype seems more consistent (as well more consistent with eg ints and strings). However, the error message can also be useful in some cases to not unexpectedly end up with objects and loose the categories information.

jreback · 2015-10-20T12:36:58Z

I tried to make these auto-coerce to object when presented with a value not in the categories, but it doesn't quite work (e.g. makes indexing 'work' sometimes which is bad).

ok, so reverted to use a modified version of @jakevdp soln; had an embedded bug which is fixed as well.

thanks for the commentary @JanSchulz @jorisvandenbossche

…ed CategoricalIndexes Signed-off-by: Jeff Reback <[email protected]>

BUG: pivot table bug with Categorical indexes, #10993

jorisvandenbossche · 2015-10-20T20:31:00Z

@jreback What do you mean xith "doesn't quite work"?

jreback · 2015-10-20T20:40:41Z

oh, I tried the strategy of having .insert/.append on a CategoricalIndex convert to Index if the values are not in the categories. But I couldn't get it to work properly with nice semantics, IOW broke stuff in indexing)

jorisvandenbossche · 2015-10-20T20:47:02Z

ok!

BUG: quick fix for pandas-dev#10989

d998337

TST: add test case from Issue pandas-dev#10989

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Categorical Categorical Data Type labels Oct 19, 2015

jreback added this to the 0.17.1 milestone Oct 19, 2015

jreback mentioned this pull request Oct 19, 2015

BUG: pivot_table with margins=True fails for categorical dtype #10989

Closed

jreback added the Enhancement label Oct 19, 2015

jreback force-pushed the jakevdp-pivot-table-categorical branch from 603e99b to 76d7c9a Compare October 19, 2015 12:58

jreback force-pushed the jakevdp-pivot-table-categorical branch from 76d7c9a to f5442ba Compare October 20, 2015 12:35

jreback changed the title ~~BUG/API: CategoricalIndex.append/insert allow insertion of new categories~~ BUG: pivot table bug with Categorical indexes, #10993 Oct 20, 2015

jreback force-pushed the jakevdp-pivot-table-categorical branch from f5442ba to 97043da Compare October 20, 2015 12:47

API: add _to_safe_for_reshape to allow safe insert/append with embedd…

7ca878e

…ed CategoricalIndexes Signed-off-by: Jeff Reback <[email protected]>

jreback force-pushed the jakevdp-pivot-table-categorical branch from 97043da to 7ca878e Compare October 20, 2015 13:01

jreback added a commit that referenced this pull request Oct 20, 2015

Merge pull request #11371 from jreback/jakevdp-pivot-table-categorical

db884d9

BUG: pivot table bug with Categorical indexes, #10993

jreback merged commit db884d9 into pandas-dev:master Oct 20, 2015

mroeschke mentioned this pull request Dec 6, 2016

TST: Test pivot with categorical data #14807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pivot table bug with Categorical indexes, #10993 #11371

BUG: pivot table bug with Categorical indexes, #10993 #11371

jreback commented Oct 19, 2015

jreback commented Oct 19, 2015

jankatins commented Oct 19, 2015

sinhrks commented Oct 19, 2015

jankatins commented Oct 19, 2015

jreback commented Oct 19, 2015

jorisvandenbossche commented Oct 20, 2015

jreback commented Oct 20, 2015

jorisvandenbossche commented Oct 20, 2015

jreback commented Oct 20, 2015

jorisvandenbossche commented Oct 20, 2015

BUG: pivot table bug with Categorical indexes, #10993 #11371

BUG: pivot table bug with Categorical indexes, #10993 #11371

Conversation

jreback commented Oct 19, 2015

jreback commented Oct 19, 2015

jankatins commented Oct 19, 2015

sinhrks commented Oct 19, 2015

jankatins commented Oct 19, 2015

jreback commented Oct 19, 2015

jorisvandenbossche commented Oct 20, 2015

jreback commented Oct 20, 2015

jorisvandenbossche commented Oct 20, 2015

jreback commented Oct 20, 2015

jorisvandenbossche commented Oct 20, 2015