Skip to content

groupby transform misbehaving with categoricals #9921

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsm054 opened this issue Apr 17, 2015 · 14 comments
Closed

groupby transform misbehaving with categoricals #9921

dsm054 opened this issue Apr 17, 2015 · 14 comments
Labels
Bug Categorical Categorical Data Type Groupby
Milestone

Comments

@dsm054
Copy link
Contributor

dsm054 commented Apr 17, 2015

Starting from this SO question, we can see that something is odd when doing a groupby using Series of category dtype. (Today's trunk, 0.16.0-163-gf9f88b2.)

>>> df = pd.DataFrame({"a": [5,15,25]})
>>> c = pd.cut(df.a, bins=[0,10,20,30,40])
>>> c
0     (0, 10]
1    (10, 20]
2    (20, 30]
Name: a, dtype: category
Categories (4, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]
>>> df.groupby(c).sum()
           a
a           
(0, 10]    5
(10, 20]  15
(20, 30]  25
(30, 40] NaN
>>> df.groupby(c).transform(sum)
Traceback (most recent call last):
  File "<ipython-input-81-1eb61986eff3>", line 1, in <module>
    df.groupby(c).transform(sum)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3017, in transform
    indexer = indices[name]
KeyError: '(30, 40]'
>>> df.a.groupby(c).transform(max)
Traceback (most recent call last):
  File "<ipython-input-87-9759e8c1e070>", line 1, in <module>
    df.a.groupby(c).transform(max)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2422, in transform
    return self._transform_fast(cyfunc)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 2461, in _transform_fast
    values = np.repeat(values, com._ensure_platform_int(counts))
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 393, in repeat
    return repeat(repeats, axis)
ValueError: count < 0

It looks like we're somehow using the categories themselves, not the values. Not sure if this is a consequence of the special-casing of grouping on categories..

ISTM if we have a Series with categorical dtype passed to groupby, we should group on the values, and not return the missing elements-- if you want the missing elements, you should pass the Categorical object itself. Admittedly I haven't thought through all the consequences, but I was pretty surprised when I read the docs and saw this was the intended behaviour.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2015

@dsm054 no this is the defined behavior of a Categorical grouper to by definition group on the specified groups.

pd.cut returns a Categorical.

@jreback jreback added this to the 0.16.1 milestone Apr 17, 2015
@dsm054
Copy link
Contributor Author

dsm054 commented Apr 17, 2015

pd.cut does not always return a Categorical:

>>> df = pd.DataFrame({"a": [5,15,25]})
>>> c = pd.cut(df.a, bins=[0,10,20,30,40])
>>> c
0     (0, 10]
1    (10, 20]
2    (20, 30]
Name: a, dtype: category
Categories (4, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]
>>> type(c)
<class 'pandas.core.series.Series'>

@jreback
Copy link
Contributor

jreback commented Apr 17, 2015

its a Categorical Series which is the same.

@dsm054
Copy link
Contributor Author

dsm054 commented Apr 17, 2015

? I must be missing something.

A series with categorical dype contains the information about what category each index corresponds to. This is exactly the information we need to perform a groupby.

A Categorical itself does not contain this information, e.g.

>>> pd.Categorical(['(0, 10]', '(10, 20]', '(20, 30]', '(30, 40]'], ordered=True)
[(0, 10], (10, 20], (20, 30], (30, 40]]
Categories (4, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40]]

It looks to me like this bug is due exactly to thinking we should be working with the Categorical side of things, when we really need the Series side. That's why the transform is getting confused, because '(30, 40]' doesn't correspond to anything in the data itself.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2015

A categorical series just holds a categorical. These are exactly the same; except you have an attached index. You actually just need to group on the categorical, e.g. something like

if is_categorical_dtype(obj):
    if isinstance(obj, Series):
           obj = obj.values

    .....we have a categorical at this point

@dsm054
Copy link
Contributor Author

dsm054 commented Apr 17, 2015

Wow, this is some crazy terminology, and now I understand why I couldn't make sense of anything you were saying!

You're right: we use the Categorical both for an object specifically encoding the type (the set of possible values and the ordering among them, the dtype=category part described in "Categories:" ) and also the vector result of applying what I think of as a categorical encoding to an ordinary type.

My preferred behaviour would have been that if you pass a Categorical, you get the missing values; if you pass a Series, you don't. But even if that ship has sailed, .values won't cut it, because the same issue happens when you group on a Categorical.

It looks to me like either .get_values() or .values.codes should work, because they access what I thought of as the values. Does that make sense?

>>> df = pd.DataFrame({"a": [5,15,25,15,5]})
>>> c = pd.cut(df.a, bins=[0,10,20,30, 40])
>>> df.groupby(c).transform(sum)
Traceback (most recent call last):
  File "<ipython-input-118-1eb61986eff3>", line 1, in <module>
    df.groupby(c).transform(sum)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3017, in transform
    indexer = indices[name]
KeyError: '(30, 40]'

>>> df.groupby(c.values).transform(sum)
Traceback (most recent call last):
  File "<ipython-input-119-35406dcb9540>", line 1, in <module>
    df.groupby(c.values).transform(sum)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_163_gf9f88b2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3017, in transform
    indexer = indices[name]
KeyError: '(30, 40]'

>>> df.groupby(c.get_values()).transform(sum)
    a
0  10
1  30
2  25
3  30
4  10
>>> df.groupby(c.values.codes).transform(sum)
    a
0  10
1  30
2  25
3  30
4  10

@jreback
Copy link
Contributor

jreback commented Apr 17, 2015

yeh u group on the codes so it may not be doing the correct thing for transform

reg groupby ops work nicely

@dsm054
Copy link
Contributor Author

dsm054 commented Apr 17, 2015

Urf, the same thing will happen with filter too.

@shoyer
Copy link
Member

shoyer commented Apr 17, 2015

This still seems like a bug to me. It should be OK to transform even if not every category is found in the data, yes? Otherwise the interface is not very useable.

@jreback
Copy link
Contributor

jreback commented Apr 17, 2015

sure - it's the fast path that not handled properly

@dsm054
Copy link
Contributor Author

dsm054 commented Apr 17, 2015

Yeah, the conversation went off track because of (1) my misunderstanding of how Categoricals are implemented, and (2) the fact I don't like the official behaviour. But the current transform and filter behaviour are both broken.

@evanpw
Copy link
Contributor

evanpw commented Apr 19, 2015

What's the broken behavior for filter? This seems to work as expected:

>>> df = pd.DataFrame({"a": [5,15,25,15,5]})
>>> c = pd.cut(df.a, bins=[0,10,20,30, 40])
>>> df.groupby(c).filter(lambda x: len(x) > 1)
    a
0   5
1  15
3  15
4   5

@dsm054
Copy link
Contributor Author

dsm054 commented Apr 19, 2015

@evanpw: for example,

>>> df = pd.DataFrame({"a": [5,15,25]})
>>> c = pd.cut(df.a, bins=[0,10,20,30,40])
>>> df.groupby(c).filter(np.all)
Traceback (most recent call last):
  File "<ipython-input-12-78d672ed03d5>", line 1, in <module>
    df.groupby(c).filter(np.all)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_173_g67b08f6-py2.7-linux-i686.egg/pandas/core/groupby.py", line 3121, in filter
    indices.append(self._get_index(name))
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_173_g67b08f6-py2.7-linux-i686.egg/pandas/core/groupby.py", line 448, in _get_index
    return self.indices[name]
KeyError: '(30, 40]'

@jreback
Copy link
Contributor

jreback commented Apr 29, 2015

closed by #9994

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

No branches or pull requests

4 participants