Skip to content

ERR: cut/qcut need better error message when passing invalid input #13318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simonm3 opened this issue May 29, 2016 · 11 comments · Fixed by #30691
Closed

ERR: cut/qcut need better error message when passing invalid input #13318

simonm3 opened this issue May 29, 2016 · 11 comments · Fixed by #30691
Assignees
Labels
Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas good first issue quantile quantile method Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@simonm3
Copy link

simonm3 commented May 29, 2016

Labels=False means use integers as category names

To use the category labels I would expect to say labels=True but instead you have to say labels=None.

It seems illogical to say labels=None when you want labels.

@jreback
Copy link
Contributor

jreback commented May 30, 2016

Not sure what you are refering, labels=True is not valid it can only accept None, False, or a list-like.

True actually causes an error, so should check that. pls send a pull-request!

In [18]: pd.qcut(range(5), 4, labels=['good','bad','ugly','terrible'])
Out[18]: 
[good, good, bad, ugly, terrible]
Categories (4, object): [good < bad < ugly < terrible]

In [19]: pd.qcut(range(5), 4, labels=None)
Out[19]: 
[[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]

In [20]: pd.qcut(range(5), 4, labels=False)
Out[20]: array([0, 0, 1, 2, 3])

In [21]: pd.qcut(range(5), 4, labels=['good','bad','ugly','terrible'])
Out[21]: 
[good, good, bad, ugly, terrible]
Categories (4, object): [good < bad < ugly < terrible]

In [22]: pd.qcut(range(5), 4, labels=True)
TypeError: object of type 'bool' has no len()

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas Difficulty Novice labels May 30, 2016
@jreback jreback added this to the Next Major Release milestone May 30, 2016
@jreback jreback changed the title qcut and cut labels=True gives error ERR: cut/qcut need better error message when passing invalid input May 30, 2016
@simonm3
Copy link
Author

simonm3 commented May 30, 2016

Exactly. I was thinking labels are things like age20-30. So labels=false
means no labels use 1 2 3 4. Labels=(a,b,c) means use user defined
labels......and labels=true would mean use the system defined labels.

Labels=none suggests to me no labels. Labels=true suggests add labels.

I reckon most people would expect true to mean add labels rather than fail
with error
On 30 May 2016 1:44 p.m., "Jeff Reback" [email protected] wrote:

Not sure what you are refering, labels=True is not valid it can only
accept None, False, or a list-like.

True actually causes an error, so should check that. pls send a
pull-request!

In [18]: pd.qcut(range(5), 4, labels=['good','bad','ugly','terrible'])
Out[18]:
[good, good, bad, ugly, terrible]
Categories (4, object): [good < bad < ugly < terrible]

In [19]: pd.qcut(range(5), 4, labels=None)
Out[19]:
[[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]

In [20]: pd.qcut(range(5), 4, labels=False)
Out[20]: array([0, 0, 1, 2, 3])

In [21]: pd.qcut(range(5), 4, labels=['good','bad','ugly','terrible'])
Out[21]:
[good, good, bad, ugly, terrible]
Categories (4, object): [good < bad < ugly < terrible]

In [22]: pd.qcut(range(5), 4, labels=True)
TypeError: object of type 'bool' has no len()


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABJN6SvNjQ7wBr44z6fHk785F0OqYkCSks5qGturgaJpZM4IpXXy
.

@jreback
Copy link
Contributor

jreback commented May 30, 2016

doc-string

In [5]: pd.qcut?
Signature: pd.qcut(x, q, labels=None, retbins=False, precision=3)
Docstring:
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
1000 values for 10 quantiles would produce a Categorical object indicating
quantile membership for each data point.

Parameters
----------
x : ndarray or Series
q : integer or array of quantiles
    Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately
    array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
labels : array or boolean, default None
    Used as labels for the resulting bins. Must be of the same length as
    the resulting bins. If False, return only integer indicators of the
    bins.
retbins : bool, optional
    Whether to return the bins or not. Can be useful if bins is given
    as a scalar.
precision : int
    The precision at which to store and display the bins labels

Returns
-------
out : Categorical or Series or array of integers if labels is False
    The return type (Categorical or Series) depends on the input: a Series
    of type category if input is a Series else Categorical. Bins are
    represented as categories when categorical data is returned.
bins : ndarray of floats
    Returned only if `retbins` is True.

Notes
-----
Out of bounds values will be NA in the resulting Categorical object

@simonm3
Copy link
Author

simonm3 commented May 30, 2016

Yes it does what it says in the docs.

What I am saying is that it would be much clearer if labels=true were
defined as yes please add some labels.
On 30 May 2016 1:59 p.m., "Jeff Reback" [email protected] wrote:

doc-string

In [5]: pd.qcut?
Signature: pd.qcut(x, q, labels=None, retbins=False, precision=3)
Docstring:
Quantile-based discretization function. Discretize variable into
equal-sized buckets based on rank or based on sample quantiles. For example
1000 values for 10 quantiles would produce a Categorical object indicating
quantile membership for each data point.

Parameters

x : ndarray or Series
q : integer or array of quantiles
Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately
array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
labels : array or boolean, default None
Used as labels for the resulting bins. Must be of the same length as
the resulting bins. If False, return only integer indicators of the
bins.
retbins : bool, optional
Whether to return the bins or not. Can be useful if bins is given
as a scalar.
precision : int
The precision at which to store and display the bins labels

Returns

out : Categorical or Series or array of integers if labels is False
The return type (Categorical or Series) depends on the input: a Series
of type category if input is a Series else Categorical. Bins are
represented as categories when categorical data is returned.
bins : ndarray of floats
Returned only if retbins is True.

Notes

Out of bounds values will be NA in the resulting Categorical object


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABJN6Soi6kUZdI7-rCjq7HNQw1JMyu4Pks5qGt8kgaJpZM4IpXXy
.

@jreback
Copy link
Contributor

jreback commented May 30, 2016

labels=True doesn't make any sense, you have to pass IN the labels. What does pls have labels mean? I think you can accidently think that the bin integers are actual labels. I would rather have a nice error message for labels=True.

@simonm3
Copy link
Author

simonm3 commented May 30, 2016

Just my feedback as a new user of cut/qcut as to how it could be made more
intuitive.

Suggest you ask other new users what they think. I would imagine most would
say that if you want labels then setting labels=None does not seem
intuitive.

On 30 May 2016 at 14:11, Jeff Reback [email protected] wrote:

labels=True doesn't make any sense, you have to pass IN the labels. What
does pls have labels mean? I think you can accidently think that the bin
integers are actual labels. I would rather have a nice error message for
labels=True.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#13318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABJN6TtW-mx1bRaz_9gsQ_vtz8LpFQOWks5qGuH3gaJpZM4IpXXy
.

@onesandzeroes
Copy link
Contributor

@jreback I'm not seeing why labels=True is obviously wrong, the docstring does say it accepts 'array or boolean' so I can see why people would try passing True.

But then OP is suggesting that labels=True should produce new behaviour with automatic labels like '{varname}{group_min}-{group_max}'. That seems reasonable enough as a default set of labels.

If we don't want the new behaviour maybe the docstring should just explicitly say 'array or False' so people don't try to pass True.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

The purpose of this issue is to fix the doc-string and raise an approporite message on labels=True. The default IS to provide labels if labels are not overriden.

In [1]: pd.qcut(range(5), 4, labels=None)
Out[1]: 
[[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]

In [2]: pd.qcut(range(5), 4, labels=None).categories
Out[2]: Index([u'[0, 1]', u'(1, 2]', u'(2, 3]', u'(3, 4]'], dtype='object')

I suppose you could change this to default to labels=True to mean labels=None now. I think this would be backward compat (as it is specifically checking for False and not None).

Further not really sure labels=False is that useful anymore (before Categoricals were first class it might have been to provide numpy compat.

So if one of you wants to take this up and see what's possible w/o breaking anything (or just raise appropriately on labels=True) - go for it

@simonm3
Copy link
Author

simonm3 commented Jun 1, 2016

not really sure labels=False is that useful anymore

I agree. It is the False that is confusing because it implies there is a
True; and the False is unnecessary as you can just use Range instead.

On 1 June 2016 at 12:09, Jeff Reback [email protected] wrote:

The purpose of this issue is to fix the doc-string and raise an
approporite message on labels=True. The default IS to provide labels if
labels are not overriden.

In [1]: pd.qcut(range(5), 4, labels=None)
Out[1]:
[[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]

In [2]: pd.qcut(range(5), 4, labels=None).categories
Out[2]: Index([u'[0, 1]', u'(1, 2]', u'(2, 3]', u'(3, 4]'], dtype='object')

I suppose you could change this to default to labels=True to mean
labels=None now. I think this would be backward compat (as it is
specifically checking for False and not None).

Further not really sure labels=False is that useful anymore (before
Categoricals were first class it might have been to provide numpy compat.

So if one of you wants to take this up and see what's possible w/o
breaking anything (or just raise appropriately on labels=True) - go for it


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#13318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABJN6cNKVEaha1MlHr5ubTewA4cAtBQTks5qHWhMgaJpZM4IpXXy
.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

well, the point is you don't need to normally specify labels as they are auto-generated by default.

labels=False just turns this off (which is what I say is a bit counter-intuitive). If you didn't allow a boolean there (the False) I don't think we would be having this discussion. labels would be just to specify your own specific ones.

as you can just use Range instead.

This is not very convenient; you unless you are also passing in bins you don't want to have the nbins parameter floating around in 2 different places.

@ryankarlos
Copy link
Contributor

take

@jreback jreback modified the milestones: Contributions Welcome, 1.0 Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas good first issue quantile quantile method Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
6 participants