Skip to content

Grouping by pd.cut() results creates a categorical index #10140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
therriault opened this issue May 14, 2015 · 12 comments
Closed

Grouping by pd.cut() results creates a categorical index #10140

therriault opened this issue May 14, 2015 · 12 comments
Labels
API Design Categorical Categorical Data Type Groupby Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@therriault
Copy link

After upgrading from 0.16.0 to 0.16.1, some of my plotting stopped functioning right, and after debugging I figured out that it was an issue with the groupby results---namely, that they're getting a CategoricalIndex even though they're being grouped by a number.

This is basically what I do in my code (simplified for this example):

df = pd.DataFrame()
df['x'] = 100 * np.random.random(100)
df['y'] = df.x**2

binedges = np.arange(0,110,10)
binlabels = np.arange(5,105,10)

z = df.groupby(pd.cut(df.x,bins=binedges ,labels=binlabels )).y.mean()

And here's what that produces:

x
5       21.809599
15     215.432515
25     556.803743
35    1282.261483
45    1941.166247
55    3104.923704
65    4256.942052
75    5862.686054
85    7304.835127
95    8810.194492
Name: y, dtype: float64

When I tried plotting this, the scale was all messed up---it was treating each of the 10 bins as an element of [0,1,2,...,9], which wasn't apparent until I tried changing the xticks parameter and everything went screwy. Took a while to debug, but finally found the culprit, by looking at z.index:

CategoricalIndex([5, 15, 25, 35, 45, 55, 65, 75, 85, 95], categories=[5, 15, 25, 35, 45, 55, 65, 75, ...], ordered=True, name=u'x', dtype='category')

This isn't the desired behavior, right? I know the CategoricalIndex feature is new, but I don't think this was an intended feature. This same code previously produced an Int64Index that would plot just fine, and there's no use case I can see for changing that. Or if that was what was intended, is there any way to select an Int64Index instead (other than converting it after the fact)?

UPDATE: Also note that this screws up using pd.cut() to create bins in a dataframe, because the indices don't line up afterward. So doing something like

df['bin'] = pd.cut(df.x,bins=mybins,labels=mylabels)

will just result in an empty column. So I can't imagine that this is the correct behavior.

@shoyer
Copy link
Member

shoyer commented May 14, 2015

I agree, I think a groupby/aggregate operation should probably just use the categories from a categorical group (which would be an Int64Index, in this case), instead of creating a new CategoricalIndex.

@jreback
Copy link
Contributor

jreback commented May 15, 2015

This is correct. I think you are confused because you relabel the bin edges to integers and expect them to be a Int64Index.

A result of the split-apply will give you an index of the same type as the input index. This is expected.

All that said, there are no explict tests one way or the other on this.

@therriault
Copy link
Author

I'm mainly confused about why this worked in previous versions but not now. It doesn't seem a change that serves a purpose, and there's a clear downside, so I think an explicit test would make sense here.

@jreback
Copy link
Contributor

jreback commented May 15, 2015

@therriault it 'worked' becuase the index that was the resultant of pd.cut was relabed to an integer index (and thus it propogated).

So I think that the propogation in the groupby is correct, but what you are expecting is that 'relabling' the bins gives a different index type. (which is not unreasonable). We just need to clarify that (with some explicit tests). My point above is that I think the 'issue' is on what pd.cut does with the binlabels and not .groupby (which is the accidental side effect here).

@therriault
Copy link
Author

That makes sense. It would be ideal, though, if pd.cut either chose the index type based upon the type of the labels, or provided an option to explicitly specify that the index type it outputs. For the time being, adding the line z.index = binlabels after the groupby in the code above works, but it doesn't solve the second issue of creating numbered bins in the pd.cut command by itself.

(Use case: I calculate percentile ranks from 1 to 100 from some output, then use pd.cut(percentile_rank, bins=np.arange(0,110,10), labels=np.arange(1,11,1), right=True) to create decile ranks. That no longer works. I can work around it by calculating going back to the original rankings or by doing something like np.ceil(0.1*percentile_rank), but it's just more elegant to have explicitly defined bins.)

@jreback
Copy link
Contributor

jreback commented May 15, 2015

@therriault as I said, you are doing an 'implict' conversion in the index of pd.cut (which is a categorical), and turning into something else.

fyi, you should simply use qcut if you want quantiles

@jorisvandenbossche
Copy link
Member

Is the issue here also not that the plotting of a CategoricalIndex is not well implemented?
If our plotting machinery just sees it as integer values, then the problem would also be solved? (at least the original reason you reported this?)

Furterh, @therriault, about your update:

df['bin'] = pd.cut(df.x, bins=binedges, labels=binlabels)

This works for me. What is not working for you here? (you say you get an empty column?)

@therriault
Copy link
Author

It's not an empty column---it's there, it's just a categorical, so though it looks like a number I can't operate on it like one (and it looks empty when I try to operate on it, which is why I thought it was empty at first). It's basically the same issue as I have with plotting: things that used to work fine don't work anymore, in a way that makes older code break after updating.

Going back to your first question, that change to the plotting framework would definitely solve the immediate issue of graphing, but the other issue still remains. The latest update took a process (pd.cut) that used to determine index type based on the values passed to it, and changed it to a process where everything gets a categorical index. Was there a reason for that change that I'm missing? Don't get me wrong, I like having the categorical index option, I just don't want to have it all the time to the exclusion of other options!

@jreback
Copy link
Contributor

jreback commented May 20, 2015

@therriault want to take a crack at changing the index handling code when binlabels are passed. Relevant code is here

There are exactly 2 tests which hit this branch. Neither of which have integer binlabels. So open to thoughts of what is reasonable here.

e.g. once reasonable soln is to make the user pass in an index if they actually want one, e.g.
z = df.groupby(pd.cut(df.x,bins=binedges ,labels=Index(binlabels) )).y.mean()

This is trivial to implement and will preserve back-compat.

@jreback jreback added this to the 0.17.0 milestone May 20, 2015
@therriault
Copy link
Author

Looking at the code and trying to figure out how to do it, though I'm not quite sure enough in my solution to submit a PR yet (apologies---I'm much more of a user than a developer). But basically, what I was thinking is to take this (lines 218-220):

        levels = np.asarray(levels, dtype=object)
        np.putmask(ids, na_mask, 0)
        fac = Categorical(ids - 1, levels, ordered=True, name=name, fastpath=True)

and change it to this:

        if levels.dtype == 'int':
            levels = Series(levels,index=ids)
            fac = ids.map(levels)
        else:
            levels = np.asarray(levels, dtype=object)
            np.putmask(ids, na_mask, 0)
            fac = Categorical(ids - 1, levels, ordered=True, name=name, fastpath=True)

That seems like the simplest solution---it just adds the case where if integers are passed as labels, the outputs are integers, but otherwise they're still categoricals. That seems a pretty self-contained solution. As far as what you suggested (explicitly passing an index to cut), I think the issue is that the index is created in groupby, not cut, so it's the creation of a categorical that's the issue. (The introduction of CategoricalIndex is what started this whole thing in the first place, because groupby started taking the categoricals that cut was passing and creating categorical indices rather than inferring the type from the actual content of the index.)

Thoughts?

@jreback
Copy link
Contributor

jreback commented May 27, 2015

@therriault adding a very specific case is a non-starter. The general soln is to allow a passed Index as the binlabels as I described above.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2015

closing as soln is to use plot Categoricals (see #10254); the generated index type is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants