Grouping by pd.cut() results creates a categorical index #10140

therriault · 2015-05-14T23:03:39Z

After upgrading from 0.16.0 to 0.16.1, some of my plotting stopped functioning right, and after debugging I figured out that it was an issue with the groupby results---namely, that they're getting a CategoricalIndex even though they're being grouped by a number.

This is basically what I do in my code (simplified for this example):

df = pd.DataFrame()
df['x'] = 100 * np.random.random(100)
df['y'] = df.x**2

binedges = np.arange(0,110,10)
binlabels = np.arange(5,105,10)

z = df.groupby(pd.cut(df.x,bins=binedges ,labels=binlabels )).y.mean()

And here's what that produces:

x
5       21.809599
15     215.432515
25     556.803743
35    1282.261483
45    1941.166247
55    3104.923704
65    4256.942052
75    5862.686054
85    7304.835127
95    8810.194492
Name: y, dtype: float64

When I tried plotting this, the scale was all messed up---it was treating each of the 10 bins as an element of [0,1,2,...,9], which wasn't apparent until I tried changing the xticks parameter and everything went screwy. Took a while to debug, but finally found the culprit, by looking at z.index:

CategoricalIndex([5, 15, 25, 35, 45, 55, 65, 75, 85, 95], categories=[5, 15, 25, 35, 45, 55, 65, 75, ...], ordered=True, name=u'x', dtype='category')

This isn't the desired behavior, right? I know the CategoricalIndex feature is new, but I don't think this was an intended feature. This same code previously produced an Int64Index that would plot just fine, and there's no use case I can see for changing that. Or if that was what was intended, is there any way to select an Int64Index instead (other than converting it after the fact)?

UPDATE: Also note that this screws up using pd.cut() to create bins in a dataframe, because the indices don't line up afterward. So doing something like

df['bin'] = pd.cut(df.x,bins=mybins,labels=mylabels)

will just result in an empty column. So I can't imagine that this is the correct behavior.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-05-14T23:18:13Z

I agree, I think a groupby/aggregate operation should probably just use the categories from a categorical group (which would be an Int64Index, in this case), instead of creating a new CategoricalIndex.

jreback · 2015-05-15T12:15:44Z

This is correct. I think you are confused because you relabel the bin edges to integers and expect them to be a Int64Index.

A result of the split-apply will give you an index of the same type as the input index. This is expected.

All that said, there are no explict tests one way or the other on this.

therriault · 2015-05-15T15:26:08Z

I'm mainly confused about why this worked in previous versions but not now. It doesn't seem a change that serves a purpose, and there's a clear downside, so I think an explicit test would make sense here.

jreback · 2015-05-15T15:32:14Z

@therriault it 'worked' becuase the index that was the resultant of pd.cut was relabed to an integer index (and thus it propogated).

So I think that the propogation in the groupby is correct, but what you are expecting is that 'relabling' the bins gives a different index type. (which is not unreasonable). We just need to clarify that (with some explicit tests). My point above is that I think the 'issue' is on what pd.cut does with the binlabels and not .groupby (which is the accidental side effect here).

therriault · 2015-05-15T16:32:51Z

That makes sense. It would be ideal, though, if pd.cut either chose the index type based upon the type of the labels, or provided an option to explicitly specify that the index type it outputs. For the time being, adding the line z.index = binlabels after the groupby in the code above works, but it doesn't solve the second issue of creating numbered bins in the pd.cut command by itself.

(Use case: I calculate percentile ranks from 1 to 100 from some output, then use pd.cut(percentile_rank, bins=np.arange(0,110,10), labels=np.arange(1,11,1), right=True) to create decile ranks. That no longer works. I can work around it by calculating going back to the original rankings or by doing something like np.ceil(0.1*percentile_rank), but it's just more elegant to have explicitly defined bins.)

jreback · 2015-05-15T16:37:51Z

@therriault as I said, you are doing an 'implict' conversion in the index of pd.cut (which is a categorical), and turning into something else.

fyi, you should simply use qcut if you want quantiles

jorisvandenbossche · 2015-05-16T16:17:17Z

Is the issue here also not that the plotting of a CategoricalIndex is not well implemented?
If our plotting machinery just sees it as integer values, then the problem would also be solved? (at least the original reason you reported this?)

Furterh, @therriault, about your update:

df['bin'] = pd.cut(df.x, bins=binedges, labels=binlabels)

This works for me. What is not working for you here? (you say you get an empty column?)

therriault · 2015-05-19T21:20:52Z

It's not an empty column---it's there, it's just a categorical, so though it looks like a number I can't operate on it like one (and it looks empty when I try to operate on it, which is why I thought it was empty at first). It's basically the same issue as I have with plotting: things that used to work fine don't work anymore, in a way that makes older code break after updating.

Going back to your first question, that change to the plotting framework would definitely solve the immediate issue of graphing, but the other issue still remains. The latest update took a process (pd.cut) that used to determine index type based on the values passed to it, and changed it to a process where everything gets a categorical index. Was there a reason for that change that I'm missing? Don't get me wrong, I like having the categorical index option, I just don't want to have it all the time to the exclusion of other options!

jreback · 2015-05-20T11:52:50Z

@therriault want to take a crack at changing the index handling code when binlabels are passed. Relevant code is here

There are exactly 2 tests which hit this branch. Neither of which have integer binlabels. So open to thoughts of what is reasonable here.

e.g. once reasonable soln is to make the user pass in an index if they actually want one, e.g.
z = df.groupby(pd.cut(df.x,bins=binedges ,labels=Index(binlabels) )).y.mean()

This is trivial to implement and will preserve back-compat.

therriault · 2015-05-26T22:30:57Z

Looking at the code and trying to figure out how to do it, though I'm not quite sure enough in my solution to submit a PR yet (apologies---I'm much more of a user than a developer). But basically, what I was thinking is to take this (lines 218-220):

        levels = np.asarray(levels, dtype=object)
        np.putmask(ids, na_mask, 0)
        fac = Categorical(ids - 1, levels, ordered=True, name=name, fastpath=True)

and change it to this:

        if levels.dtype == 'int':
            levels = Series(levels,index=ids)
            fac = ids.map(levels)
        else:
            levels = np.asarray(levels, dtype=object)
            np.putmask(ids, na_mask, 0)
            fac = Categorical(ids - 1, levels, ordered=True, name=name, fastpath=True)

That seems like the simplest solution---it just adds the case where if integers are passed as labels, the outputs are integers, but otherwise they're still categoricals. That seems a pretty self-contained solution. As far as what you suggested (explicitly passing an index to cut), I think the issue is that the index is created in groupby, not cut, so it's the creation of a categorical that's the issue. (The introduction of CategoricalIndex is what started this whole thing in the first place, because groupby started taking the categoricals that cut was passing and creating categorical indices rather than inferring the type from the actual content of the index.)

Thoughts?

jreback · 2015-05-27T11:52:46Z

@therriault adding a very specific case is a non-starter. The general soln is to allow a passed Index as the binlabels as I described above.

…pandas-dev#10140)

jreback · 2015-06-20T12:48:08Z

closing as soln is to use plot Categoricals (see #10254); the generated index type is correct.

jreback added Groupby Categorical Categorical Data Type API Design labels May 15, 2015

jreback added this to the 0.17.0 milestone May 20, 2015

jorisvandenbossche modified the milestones: 0.16.2, 0.17.0 Jun 2, 2015

jreback added a commit to jreback/pandas that referenced this issue Jun 2, 2015

REGR: ensure passed binlabels to pd.cut have a compat dtype on output (…

c4c66d2

…pandas-dev#10140)

jreback mentioned this issue Jun 2, 2015

REGR: ensure passed binlabels to pd.cut have a compat dtype on output (#10140) #10252

Closed

jreback added a commit to jreback/pandas that referenced this issue Jun 2, 2015

REGR: ensure passed binlabels to pd.cut have a compat dtype on output (…

d12db3a

…pandas-dev#10140)

jreback added a commit to jreback/pandas that referenced this issue Jun 2, 2015

REGR: ensure passed binlabels to pd.cut have a compat dtype on output (…

137ca29

…pandas-dev#10140)

jreback added a commit to jreback/pandas that referenced this issue Jun 3, 2015

REGR: ensure passed binlabels to pd.cut have a compat dtype on output (…

0b9bb41

…pandas-dev#10140)

jreback added the Needs Discussion Requires discussion from core team before further action label Jun 3, 2015

jreback modified the milestones: 0.17.0, 0.16.2 Jun 3, 2015

jorisvandenbossche mentioned this issue Jun 8, 2015

API/BUG: inconsistent plotting of new CategoricalIndex #10254

Open

jreback closed this as completed Jun 20, 2015

jorisvandenbossche mentioned this issue Jul 30, 2015

API: value_counts on a Categorical Series should have a CategoricalIndex #10704

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouping by pd.cut() results creates a categorical index #10140

Grouping by pd.cut() results creates a categorical index #10140

therriault commented May 14, 2015

shoyer commented May 14, 2015

jreback commented May 15, 2015

therriault commented May 15, 2015

jreback commented May 15, 2015

therriault commented May 15, 2015

jreback commented May 15, 2015

jorisvandenbossche commented May 16, 2015

therriault commented May 19, 2015

jreback commented May 20, 2015

therriault commented May 26, 2015

jreback commented May 27, 2015

jreback commented Jun 20, 2015

Grouping by pd.cut() results creates a categorical index #10140

Grouping by pd.cut() results creates a categorical index #10140

Comments

therriault commented May 14, 2015

shoyer commented May 14, 2015

jreback commented May 15, 2015

therriault commented May 15, 2015

jreback commented May 15, 2015

therriault commented May 15, 2015

jreback commented May 15, 2015

jorisvandenbossche commented May 16, 2015

therriault commented May 19, 2015

jreback commented May 20, 2015

therriault commented May 26, 2015

jreback commented May 27, 2015

jreback commented Jun 20, 2015