groupby transform misbehaving with categoricals #9921

dsm054 · 2015-04-17T15:24:21Z

Starting from this SO question, we can see that something is odd when doing a groupby using Series of category dtype. (Today's trunk, 0.16.0-163-gf9f88b2.)

>>> df = pd.DataFrame({"a": [5,15,25]})
>>> c = pd.cut(df.a, bins=[0,10,20,30,40])
>>> c
0     (0, 10]
1    (10, 20]
2    (20, 30]
Name: a, dtype: category
Categories (4, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]
>>> df.groupby(c).sum()
           a
a           
(0, 10]    5
(10, 20]  15
(20, 30]  25
(30, 40] NaN
>>> df.groupby(c).transform(sum)
Traceback (most recent call last):
  File "<ipython-input-81-1eb61986eff3>", line 1, in <module>
    df.groupby(c).transform(sum)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3017, in transform
    indexer = indices[name]
KeyError: '(30, 40]'
>>> df.a.groupby(c).transform(max)
Traceback (most recent call last):
  File "<ipython-input-87-9759e8c1e070>", line 1, in <module>
    df.a.groupby(c).transform(max)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2422, in transform
    return self._transform_fast(cyfunc)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2461, in _transform_fast
    values = np.repeat(values, com._ensure_platform_int(counts))
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 393, in repeat
    return repeat(repeats, axis)
ValueError: count < 0

It looks like we're somehow using the categories themselves, not the values. Not sure if this is a consequence of the special-casing of grouping on categories..

ISTM if we have a Series with categorical dtype passed to groupby, we should group on the values, and not return the missing elements-- if you want the missing elements, you should pass the Categorical object itself. Admittedly I haven't thought through all the consequences, but I was pretty surprised when I read the docs and saw this was the intended behaviour.

The text was updated successfully, but these errors were encountered:

jreback · 2015-04-17T15:33:55Z

@dsm054 no this is the defined behavior of a Categorical grouper to by definition group on the specified groups.

pd.cut returns a Categorical.

dsm054 · 2015-04-17T15:35:54Z

pd.cut does not always return a Categorical:

>>> df = pd.DataFrame({"a": [5,15,25]})
>>> c = pd.cut(df.a, bins=[0,10,20,30,40])
>>> c
0     (0, 10]
1    (10, 20]
2    (20, 30]
Name: a, dtype: category
Categories (4, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]
>>> type(c)
<class 'pandas.core.series.Series'>

jreback · 2015-04-17T15:39:33Z

its a Categorical Series which is the same.

dsm054 · 2015-04-17T15:45:22Z

? I must be missing something.

A series with categorical dype contains the information about what category each index corresponds to. This is exactly the information we need to perform a groupby.

A Categorical itself does not contain this information, e.g.

>>> pd.Categorical(['(0, 10]', '(10, 20]', '(20, 30]', '(30, 40]'], ordered=True)
[(0, 10], (10, 20], (20, 30], (30, 40]]
Categories (4, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]

It looks to me like this bug is due exactly to thinking we should be working with the Categorical side of things, when we really need the Series side. That's why the transform is getting confused, because '(30, 40]' doesn't correspond to anything in the data itself.

jreback · 2015-04-17T15:48:32Z

A categorical series just holds a categorical. These are exactly the same; except you have an attached index. You actually just need to group on the categorical, e.g. something like

if is_categorical_dtype(obj):
    if isinstance(obj, Series):
           obj = obj.values

    .....we have a categorical at this point

dsm054 · 2015-04-17T16:19:32Z

Wow, this is some crazy terminology, and now I understand why I couldn't make sense of anything you were saying!

You're right: we use the Categorical both for an object specifically encoding the type (the set of possible values and the ordering among them, the dtype=category part described in "Categories:" ) and also the vector result of applying what I think of as a categorical encoding to an ordinary type.

My preferred behaviour would have been that if you pass a Categorical, you get the missing values; if you pass a Series, you don't. But even if that ship has sailed, .values won't cut it, because the same issue happens when you group on a Categorical.

It looks to me like either .get_values() or .values.codes should work, because they access what I thought of as the values. Does that make sense?

>>> df = pd.DataFrame({"a": [5,15,25,15,5]})
>>> c = pd.cut(df.a, bins=[0,10,20,30, 40])
>>> df.groupby(c).transform(sum)
Traceback (most recent call last):
  File "<ipython-input-118-1eb61986eff3>", line 1, in <module>
    df.groupby(c).transform(sum)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3017, in transform
    indexer = indices[name]
KeyError: '(30, 40]'

>>> df.groupby(c.values).transform(sum)
Traceback (most recent call last):
  File "<ipython-input-119-35406dcb9540>", line 1, in <module>
    df.groupby(c.values).transform(sum)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3017, in transform
    indexer = indices[name]
KeyError: '(30, 40]'

>>> df.groupby(c.get_values()).transform(sum)
    a
0  10
1  30
2  25
3  30
4  10
>>> df.groupby(c.values.codes).transform(sum)
    a
0  10
1  30
2  25
3  30
4  10

jreback · 2015-04-17T16:28:33Z

yeh u group on the codes so it may not be doing the correct thing for transform

reg groupby ops work nicely

dsm054 · 2015-04-17T16:45:50Z

Urf, the same thing will happen with filter too.

shoyer · 2015-04-17T20:12:10Z

This still seems like a bug to me. It should be OK to transform even if not every category is found in the data, yes? Otherwise the interface is not very useable.

jreback · 2015-04-17T20:23:02Z

sure - it's the fast path that not handled properly

dsm054 · 2015-04-17T20:26:17Z

Yeah, the conversation went off track because of (1) my misunderstanding of how Categoricals are implemented, and (2) the fact I don't like the official behaviour. But the current transform and filter behaviour are both broken.

evanpw · 2015-04-19T20:23:42Z

What's the broken behavior for filter? This seems to work as expected:

>>> df = pd.DataFrame({"a": [5,15,25,15,5]})
>>> c = pd.cut(df.a, bins=[0,10,20,30, 40])
>>> df.groupby(c).filter(lambda x: len(x) > 1)
    a
0   5
1  15
3  15
4   5

dsm054 · 2015-04-19T22:01:54Z

@evanpw: for example,

>>> df = pd.DataFrame({"a": [5,15,25]})
>>> c = pd.cut(df.a, bins=[0,10,20,30,40])
>>> df.groupby(c).filter(np.all)
Traceback (most recent call last):
  File "<ipython-input-12-78d672ed03d5>", line 1, in <module>
    df.groupby(c).filter(np.all)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_173_g67b08f6-py2.7-linux-i686.egg/pandas/core/groupby.py", line 3121, in filter
    indices.append(self._get_index(name))
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_173_g67b08f6-py2.7-linux-i686.egg/pandas/core/groupby.py", line 448, in _get_index
    return self.indices[name]
KeyError: '(30, 40]'

jreback · 2015-04-29T10:32:15Z

closed by #9994

jreback added Bug Groupby Categorical Categorical Data Type Difficulty Intermediate labels Apr 17, 2015

jreback added this to the 0.16.1 milestone Apr 17, 2015

evanpw mentioned this issue Apr 26, 2015

BUG: transform and filter misbehave when grouping on categorical data #9994

Merged

jreback closed this as completed Apr 29, 2015

jreback mentioned this issue May 12, 2015

BUG: filter with a multi-grouping including a datetimelike fails #10114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby transform misbehaving with categoricals #9921

groupby transform misbehaving with categoricals #9921

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

shoyer commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

evanpw commented Apr 19, 2015

dsm054 commented Apr 19, 2015

jreback commented Apr 29, 2015

groupby transform misbehaving with categoricals #9921

groupby transform misbehaving with categoricals #9921

Comments

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

shoyer commented Apr 17, 2015

jreback commented Apr 17, 2015

dsm054 commented Apr 17, 2015

evanpw commented Apr 19, 2015

dsm054 commented Apr 19, 2015

jreback commented Apr 29, 2015