-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Grouping by pd.cut() results creates a categorical index #10140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree, I think a groupby/aggregate operation should probably just use the categories from a categorical group (which would be an |
This is correct. I think you are confused because you relabel the bin edges to integers and expect them to be a A result of the split-apply will give you an index of the same type as the input index. This is expected. All that said, there are no explict tests one way or the other on this. |
I'm mainly confused about why this worked in previous versions but not now. It doesn't seem a change that serves a purpose, and there's a clear downside, so I think an explicit test would make sense here. |
@therriault it 'worked' becuase the index that was the resultant of So I think that the propogation in the groupby is correct, but what you are expecting is that 'relabling' the bins gives a different index type. (which is not unreasonable). We just need to clarify that (with some explicit tests). My point above is that I think the 'issue' is on what |
That makes sense. It would be ideal, though, if (Use case: I calculate percentile ranks from 1 to 100 from some output, then use |
@therriault as I said, you are doing an 'implict' conversion in the index of fyi, you should simply use |
Is the issue here also not that the plotting of a CategoricalIndex is not well implemented? Furterh, @therriault, about your update:
This works for me. What is not working for you here? (you say you get an empty column?) |
It's not an empty column---it's there, it's just a categorical, so though it looks like a number I can't operate on it like one (and it looks empty when I try to operate on it, which is why I thought it was empty at first). It's basically the same issue as I have with plotting: things that used to work fine don't work anymore, in a way that makes older code break after updating. Going back to your first question, that change to the plotting framework would definitely solve the immediate issue of graphing, but the other issue still remains. The latest update took a process ( |
@therriault want to take a crack at changing the index handling code when There are exactly 2 tests which hit this branch. Neither of which have integer binlabels. So open to thoughts of what is reasonable here. e.g. once reasonable soln is to make the user pass in an index if they actually want one, e.g. This is trivial to implement and will preserve back-compat. |
Looking at the code and trying to figure out how to do it, though I'm not quite sure enough in my solution to submit a PR yet (apologies---I'm much more of a user than a developer). But basically, what I was thinking is to take this (lines 218-220): levels = np.asarray(levels, dtype=object)
np.putmask(ids, na_mask, 0)
fac = Categorical(ids - 1, levels, ordered=True, name=name, fastpath=True) and change it to this: if levels.dtype == 'int':
levels = Series(levels,index=ids)
fac = ids.map(levels)
else:
levels = np.asarray(levels, dtype=object)
np.putmask(ids, na_mask, 0)
fac = Categorical(ids - 1, levels, ordered=True, name=name, fastpath=True) That seems like the simplest solution---it just adds the case where if integers are passed as labels, the outputs are integers, but otherwise they're still categoricals. That seems a pretty self-contained solution. As far as what you suggested (explicitly passing an index to Thoughts? |
@therriault adding a very specific case is a non-starter. The general soln is to allow a passed |
closing as soln is to use plot Categoricals (see #10254); the generated index type is correct. |
After upgrading from 0.16.0 to 0.16.1, some of my plotting stopped functioning right, and after debugging I figured out that it was an issue with the groupby results---namely, that they're getting a
CategoricalIndex
even though they're being grouped by a number.This is basically what I do in my code (simplified for this example):
And here's what that produces:
When I tried plotting this, the scale was all messed up---it was treating each of the 10 bins as an element of
[0,1,2,...,9]
, which wasn't apparent until I tried changing thexticks
parameter and everything went screwy. Took a while to debug, but finally found the culprit, by looking atz.index
:This isn't the desired behavior, right? I know the
CategoricalIndex
feature is new, but I don't think this was an intended feature. This same code previously produced anInt64Index
that would plot just fine, and there's no use case I can see for changing that. Or if that was what was intended, is there any way to select anInt64Index
instead (other than converting it after the fact)?UPDATE: Also note that this screws up using pd.cut() to create bins in a dataframe, because the indices don't line up afterward. So doing something like
will just result in an empty column. So I can't imagine that this is the correct behavior.
The text was updated successfully, but these errors were encountered: