-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
groupby transform misbehaving with categoricals #9921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@dsm054 no this is the defined behavior of a Categorical grouper to by definition group on the specified groups.
|
|
its a Categorical Series which is the same. |
? I must be missing something. A series with categorical dype contains the information about what category each index corresponds to. This is exactly the information we need to perform a groupby. A Categorical itself does not contain this information, e.g.
It looks to me like this bug is due exactly to thinking we should be working with the Categorical side of things, when we really need the Series side. That's why the transform is getting confused, because '(30, 40]' doesn't correspond to anything in the data itself. |
A categorical series just holds a categorical. These are exactly the same; except you have an attached index. You actually just need to group on the categorical, e.g. something like
|
Wow, this is some crazy terminology, and now I understand why I couldn't make sense of anything you were saying! You're right: we use the Categorical both for an object specifically encoding the type (the set of possible values and the ordering among them, the dtype=category part described in "Categories:" ) and also the vector result of applying what I think of as a categorical encoding to an ordinary type. My preferred behaviour would have been that if you pass a Categorical, you get the missing values; if you pass a Series, you don't. But even if that ship has sailed, It looks to me like either
|
yeh u group on the codes so it may not be doing the correct thing for transform reg groupby ops work nicely |
Urf, the same thing will happen with |
This still seems like a bug to me. It should be OK to transform even if not every category is found in the data, yes? Otherwise the interface is not very useable. |
sure - it's the fast path that not handled properly |
Yeah, the conversation went off track because of (1) my misunderstanding of how Categoricals are implemented, and (2) the fact I don't like the official behaviour. But the current transform and filter behaviour are both broken. |
What's the broken behavior for
|
@evanpw: for example,
|
closed by #9994 |
Starting from this SO question, we can see that something is odd when doing a groupby using Series of category dtype. (Today's trunk, 0.16.0-163-gf9f88b2.)
It looks like we're somehow using the categories themselves, not the values. Not sure if this is a consequence of the special-casing of grouping on categories..
ISTM if we have a Series with categorical dtype passed to groupby, we should group on the values, and not return the missing elements-- if you want the missing elements, you should pass the Categorical object itself. Admittedly I haven't thought through all the consequences, but I was pretty surprised when I read the docs and saw this was the intended behaviour.
The text was updated successfully, but these errors were encountered: