-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: ensure passed binlabels to pd.cut have a compat dtype on output (#10140) #10252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @therriault |
d12db3a
to
137ca29
Compare
I changed this slightly, so that non object/categorical indexes will be coerced (meaning that if you are passing in an object/cat index, or a list of stuff that is object like - e.g. strings), then you would get a categorical out (which is the same as current behavior). So the following is essentially equivalent to the above (e.g. we are passing an index-like in, so preserve it) - fyi, this works for lists too, it just needs to be convertible to an Index.
|
@jreback Thanks! BTW, I understand now our confusion yesterday a bit, you were still talking about the Index that comes out of a groupby with cut, but I think it is simpler to just talk about the output of To summarise with a smaller example:
So the 'rule' is now: dtype of the resulting Series is the dtype of the provided
(while now (0.16.1) both examples above would return a Categorical Series with integer categories) @shoyer @TomAugspurger agree with the above? |
@jreback still two comments:
This seems to deviate from the 'rule'. For consistency, I would personally not do this (also return a object Series in the example above, and requiring
Shouldn't the first case be an array? |
@jorisvandenbossche slight addition to your rule: when |
no, arrays and Index are equivalent. The output is a
This is ALSO true for integers/datetimes, however I think these are more natural in their non-categorical form already. |
@jorisvandenbossche your last point is the crux of the issue. We are making a change here to have the output depend on the input (e.g. what type I think this is for practicality. The output of |
@jreback Hmm, it is indeed a bit odd, I have to agree. And looking back at older versions (what I forgot yesterday when I looked at this issue), |
@jorisvandenbossche I agree. It IS correct to propgate the labels dtype thru the groupby. Ok let's defer this and see if your plotting 'fix' (at least the first version), solves this. |
Sorry for missing the point that is has always been a Categorical when I started the discussion yesterday! |
Thanks for working through this everyone (and @jreback, sorry for not chiming in earlier). Totally understand the concern about changing the output type, which is not where I meant for this to go at all! Ultimately, it sounds like my original usage just took advantage of an accidental bit of functionality that existed prior to the introduction of categorical indicates but wasn't part of the intended design. As such, if trying to restore it just creates more problems elsewhere, it's probably not worth fixing---it's straightforward enough to work around by assigning In any case, changing the way these are plotted (as you discuss on the other thread) would address the primary issue. My main use case is in creating calibration plots (similar http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html, but with custom bins) to validate predictive models (a pretty standard task in the machine learning community). That's why I set the value of the bin to its midpoint and use a numeric scale rather than categorical---there may be empty bins, so a categorical scale that just skips those altogether would make the x- and y-axes out of sync with one another. |
@therriault thanks for the fb. I think the plotting will be addressed in #10254. Feel free to chime in on that. I am defering the PR that had done on this issue till 0.17.0 and will think about it some more. |
@therriault see the discussion in #10254: we are thinking to change the way categoricals are plotted all the same. So in that case, that will not fix your original reported issue and you will still need the workaround. BTW, for your index, instead of reassigning the indexlabels, you can maybe better just convert your index to an integer one (eg with |
From the original issue, closes #10140
This now yields an
Int64Index
If you pass an index in, then you will get that out, specifically a
Categorical
will preserve the categories (note the 2nd example has reversed labels - why anyone would do this, don't know.......)