Skip to content

REGR: ensure passed binlabels to pd.cut have a compat dtype on output (#10140) #10252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 2, 2015

From the original issue, closes #10140

In [1]: df = pd.DataFrame()

In [2]: df['x'] = 100 * np.random.random(100)

In [3]: df['y'] = df.x**2

In [4]: binedges = np.arange(0,110,10)

In [5]: binlabels = np.arange(5,105,10)

This now yields an Int64Index

In [6]: df.groupby(pd.cut(df.x,bins=binedges ,labels=binlabels )).y.mean()
Out[6]: 
5       43.485680
15     227.648305
25     680.165208
35    1244.704812
45    1977.675209
55    3159.067969
65    4435.735803
75    5649.164598
85    7027.956570
95    8997.975423
Name: y, dtype: float64

In [7]: df.groupby(pd.cut(df.x,bins=binedges ,labels=binlabels )).y.mean().index
Out[7]: Int64Index([5, 15, 25, 35, 45, 55, 65, 75, 85, 95], dtype='int64')

If you pass an index in, then you will get that out, specifically a Categorical will preserve the categories (note the 2nd example has reversed labels - why anyone would do this, don't know.......)

In [8]: df.groupby(pd.cut(df.x,bins=binedges ,labels=pd.Categorical(binlabels) )).y.mean().index
Out[8]: CategoricalIndex([5, 15, 25, 35, 45, 55, 65, 75, 85, 95], categories=[5, 15, 25, 35, 45, 55, 65, 75, ...], ordered=True, name=u'x', dtype='category')

In [9]: df.groupby(pd.cut(df.x,bins=binedges ,labels=pd.Categorical(binlabels,categories=binlabels[::-1]) )).y.mean().index
Out[9]: CategoricalIndex([95, 85, 75, 65, 55, 45, 35, 25, 15, 5], categories=[95, 85, 75, 65, 55, 45, 35, 25, ...], ordered=True, name=u'x', dtype='category')

In [10]: df.groupby(pd.cut(df.x,bins=binedges ,labels=pd.Categorical(binlabels,categories=binlabels[::-1]) )).y.mean()      
Out[10]: 
x
95      43.485680
85     227.648305
75     680.165208
65    1244.704812
55    1977.675209
45    3159.067969
35    4435.735803
25    5649.164598
15    7027.956570
5     8997.975423
Name: y, dtype: float64

@jreback jreback changed the title REGR: ensure passed binlabels to pd.cut have a compat dtype on output (#10410) REGR: ensure passed binlabels to pd.cut have a compat dtype on output (#10140) Jun 2, 2015
@jreback jreback added Groupby API Design Categorical Categorical Data Type labels Jun 2, 2015
@jreback jreback added this to the 0.16.2 milestone Jun 2, 2015
@jreback
Copy link
Contributor Author

jreback commented Jun 2, 2015

cc @therriault
should solve your issue, pls have a look

@jreback jreback force-pushed the cut branch 2 times, most recently from d12db3a to 137ca29 Compare June 2, 2015 23:46
@jreback
Copy link
Contributor Author

jreback commented Jun 2, 2015

I changed this slightly, so that non object/categorical indexes will be coerced (meaning that if you are passing in an object/cat index, or a list of stuff that is object like - e.g. strings), then you would get a categorical out (which is the same as current behavior).

So the following is essentially equivalent to the above (e.g. we are passing an index-like in, so preserve it) - fyi, this works for lists too, it just needs to be convertible to an Index.

In [19]: df.groupby(pd.cut(df.x,bins=binedges, labels=pd.DatetimeIndex(['20130101']*len(binlabels))+pd.TimedeltaIndex(binlabels,unit='D'))).y.mean()
Out[19]: 
2013-01-06      36.464340
2013-01-16     324.090951
2013-01-26     591.358244
2013-02-05    1327.934110
2013-02-15    1998.929692
2013-02-25    2813.419476
2013-03-07    4260.447121
2013-03-17    5729.868643
2013-03-27    7208.585942
2013-04-06    9034.624990
Name: y, dtype: float64

@jorisvandenbossche
Copy link
Member

@jreback Thanks!

BTW, I understand now our confusion yesterday a bit, you were still talking about the Index that comes out of a groupby with cut, but I think it is simpler to just talk about the output of cut itself (as, I think, we agree that a groupby should preserve the dtype, so grouping by a categorical column will give a CategoricalIndex, grouping by an integer column will give a Int64Index, etc ...)

To summarise with a smaller example:

df = pd.DataFrame()
df['x'] = 100 * np.random.random(5)
df['y'] = df.x**2
binedges = np.arange(0,110,10)
binlabels = np.arange(5,105,10)

So the 'rule' is now: dtype of the resulting Series is the dtype of the provided binlabels (in 0.16.1 the rule was: always give Categorical Series back):

In [49]: pd.cut(df.x, bins=binedges, labels=binlabels)
Out[49]:
0    65
1    25
2    65
3    65
4    25
dtype: int64

In [50]: pd.cut(df.x, bins=binedges, labels=pd.Categorical(binlabels))
Out[50]:
0    65
1    25
2    65
3    65
4    25
Name: x, dtype: category
Categories (10, int64): [5 < 15 < 25 < 35 ... 65 < 75 < 85 < 95]

(while now (0.16.1) both examples above would return a Categorical Series with integer categories)

@shoyer @TomAugspurger agree with the above?

@jorisvandenbossche
Copy link
Member

@jreback still two comments:

  1. What to do with object dtype (as in your last change, you included that when passing eg string array to labels (object dtype in pandas), these would result in a Categorical instead of object dtyped Series):
In [51]: string_labels = np.array(['l_' + str(i) for i in binlabels])

In [52]: pd.cut(df.x, bins=binedges, labels=string_labels)
Out[52]:
0    l_65
1    l_25
2    l_65
3    l_65
4    l_25
Name: x, dtype: category
Categories (10, object): [l_5 < l_15 < l_25 < l_35 ... l_65 < l_75 < l_85 < l_95]

This seems to deviate from the 'rule'. For consistency, I would personally not do this (also return a object Series in the example above, and requiring labels=pd.Categorical(string_labels) for obtaining the output above), but I understand that for convenience this can be handy ..

  1. How to handle labels when the first argument is an array and not a Series. Followng the docs, this should result in a Categorical (which is conceptually equivalent to an array), but now with this PR it returns an Index or a Categorical depending on the labels type:
In [59]: pd.cut(df.x.values, bins=binedges, labels=binlabels)
Out[59]: Int64Index([65, 25, 65, 65, 25], dtype='int64')

In [60]: pd.cut(df.x.values, bins=binedges, labels=pd.Categorical(binlabels))
Out[60]:
[65, 25, 65, 65, 25]
Categories (10, int64): [5 < 15 < 25 < 35 ... 65 < 75 < 85 < 95]

Shouldn't the first case be an array?

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche slight addition to your rule: when labels is not provided the result is a Categorical Series (or maybe an a array once we figure out 2 in your last post).

@jreback
Copy link
Contributor Author

jreback commented Jun 3, 2015

no, arrays and Index are equivalent. The output is a Series. So by definition you have the values as array-like (we are just discussing the dtype here).

object and categorical inputs must be the same. Otherwise you have a very odd case of when do output object vs. categorical. First, the current output is categorical, second, these have an almost identical API, and third, you simply lose information if you turn this into an object array (as it loses its categories).

This is ALSO true for integers/datetimes, however I think these are more natural in their non-categorical form already.

@jreback
Copy link
Contributor Author

jreback commented Jun 3, 2015

@jorisvandenbossche your last point is the crux of the issue.

We are making a change here to have the output depend on the input (e.g. what type labels actually are).

I think this is for practicality. The output of pd.cut is BY DEFINITION a categorical. But the user says that instead of the auto-generated labels, they want 'these' labels, BUT they don't want a categorical. So it IS a bit odd.

@jorisvandenbossche
Copy link
Member

@jreback Hmm, it is indeed a bit odd, I have to agree.

And looking back at older versions (what I forgot yesterday when I looked at this issue), pd.cut has always returned a Categorical (also before the implementation of CategoricalBlock and CategoricalIndex, tested with 0.14), also when providing labels.
So, changing this (eg now returnin int series when providing int labels) will be a API change .. (and not something we should do in a point release, or maybe never)

@jorisvandenbossche
Copy link
Member

If we look at the original issue that caused this discussion (#10140), that will also be solved by fixing #10254 (at least the plotting issue after the groupby)

@jreback
Copy link
Contributor Author

jreback commented Jun 3, 2015

@jorisvandenbossche I agree. It IS correct to propgate the labels dtype thru the groupby. Ok let's defer this and see if your plotting 'fix' (at least the first version), solves this.

@jreback jreback modified the milestones: 0.17.0, 0.16.2 Jun 3, 2015
@jreback jreback added the Needs Discussion Requires discussion from core team before further action label Jun 3, 2015
@jorisvandenbossche
Copy link
Member

Sorry for missing the point that is has always been a Categorical when I started the discussion yesterday!

@therriault
Copy link

Thanks for working through this everyone (and @jreback, sorry for not chiming in earlier). Totally understand the concern about changing the output type, which is not where I meant for this to go at all!

Ultimately, it sounds like my original usage just took advantage of an accidental bit of functionality that existed prior to the introduction of categorical indicates but wasn't part of the intended design. As such, if trying to restore it just creates more problems elsewhere, it's probably not worth fixing---it's straightforward enough to work around by assigning z.index = binlabels on the next line. (The use of custom bins is what differentiates this from what I'd do with pd.qcut---in other circumstances, I've solved the same issue by multiplying by x = (1/binwidth), rounding, then dividing by x again, but pd.cut seemed a more elegant solution.) If there were an optional index_type parameter within pd.cut that would attempt to coerce it, that'd make it more convenient, but it wouldn't add much in the way of real functionality.

In any case, changing the way these are plotted (as you discuss on the other thread) would address the primary issue. My main use case is in creating calibration plots (similar http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html, but with custom bins) to validate predictive models (a pretty standard task in the machine learning community). That's why I set the value of the bin to its midpoint and use a numeric scale rather than categorical---there may be empty bins, so a categorical scale that just skips those altogether would make the x- and y-axes out of sync with one another.

@jreback
Copy link
Contributor Author

jreback commented Jun 9, 2015

@therriault thanks for the fb. I think the plotting will be addressed in #10254. Feel free to chime in on that. I am defering the PR that had done on this issue till 0.17.0 and will think about it some more.

@jorisvandenbossche
Copy link
Member

@therriault see the discussion in #10254: we are thinking to change the way categoricals are plotted all the same. So in that case, that will not fix your original reported issue and you will still need the workaround.

BTW, for your index, instead of reassigning the indexlabels, you can maybe better just convert your index to an integer one (eg with z.index = z.index.astype(int) or z.index = np.asarray(z.index)), this will probably be more robust for the case there are bins missing in the result.

@jreback jreback closed this Jun 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Grouping by pd.cut() results creates a categorical index
4 participants