Skip to content

Numpy 'inf' values cause pandas.cut to fail #24314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vaughnkoch opened this issue Dec 16, 2018 · 6 comments · Fixed by #24327
Closed

Numpy 'inf' values cause pandas.cut to fail #24314

vaughnkoch opened this issue Dec 16, 2018 · 6 comments · Fixed by #24327
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@vaughnkoch
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Series([1, 2, 3,])
bar = pd.Series([1, 2, 0])
baz = foo / bar
cut = pd.cut(baz, 8, duplicates='drop')

*** ValueError: missing values must be missing in the same location both left and right sides

Problem description

Having an 'inf' value in a Series seems to cause pandas.cut to fail with this error:
*** ValueError: missing values must be missing in the same location both left and right sides

I saw bug #19768 already, but that was fixed by PR 19833 in Feb, and I'm using 0.23.4 which was released on August 3, 2018. Also there's #5483, which was fixed a long time ago.

Expected Output

'inf' should probably be similar to the current NA-handling behavior: it should at least not raise an exception, and just drop that as a usable value.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.11.0
xarray: None
IPython: 6.5.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.10
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: 4.2.4
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel
Copy link
Member

This should raise, as bins should not be specified as an integer when the input data contains infinity. The error message could certainly be improved though.

A more concise example of the error in question:

In [2]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3)
---------------------------------------------------------------------------
ValueError: Bin edges must be unique: array([nan, inf, inf, inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [3]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3, duplicates='drop')
---------------------------------------------------------------------------
ValueError: missing values must be missing in the same location both left and right sides

Note that duplicates='drop' is just delaying the error from [2].

As to why this is invalid, from the documentation for cut we have the following description for bins:

bins : int, sequence of scalars, or pandas.IntervalIndex
The criteria to bin by.

  • int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

Specifying bins as an integer means getting that number of equal width bins that the span the range of the input data, but when your input data contains infinity then the range is infinite, so each bucket would also need to be infinite, which doesn't make sense.

The way to handle this would be to specify bins using one of the alternative options, or to use qcut:

In [4]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=[-1, 2, np.inf])
Out[4]:
[(-1.0, 2.0], (-1.0, 2.0], (-1.0, 2.0], (2.0, inf], (2.0, inf], (2.0, inf]]
Categories (2, interval[float64]): [(-1.0, 2.0] < (2.0, inf]]

In [5]: pd.qcut([0, 1, 2, 3, 4, np.inf], q=3)
Out[5]:
[(-0.001, 1.667], (-0.001, 1.667], (1.667, 3.333], (1.667, 3.333], (3.333, inf], (3.333, inf]]
Categories (3, interval[float64]): [(-0.001, 1.667] < (1.667, 3.333] < (3.333, inf]]

@jschendel jschendel added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Dec 18, 2018
@jschendel jschendel added this to the 0.24.0 milestone Dec 18, 2018
@vaughnkoch
Copy link
Author

That makes sense, thanks for sharing.

What do you think of having it just ignore inf values (and if there no binnable values, then raise?)
If you have inf values in the source data, the caller has to do extra work to remove them anyways, which pandas could do as well. Then you could just put that in the docs that inf is converted to NA.

@jschendel
Copy link
Member

What do you think of having it just ignore inf values (and if there no binnable values, then raise?)

I'd be a bit hesitant to silently ignore infinite values. This would make the handling of infinity a bit inconsistent within cut since infinity is valid for other ways of using cut. It would also make it harder for users to diagnose if they're using the wrong combination of parameters with cut, or if something is wrong with their assumptions about their data (infinity being present when it shouldn't). Additionally, I'm not aware of any other methods that handle infinity in a similar way (could be forgetting something), so it'd be setting a new precedence within the codebase in regards to infinity.

All that being said, I wouldn't necessarily be opposed to changing the behavior in that regard if a consensus is reached to support it. I think it would warrant a larger discussion though. Feel free to open a new issue regarding the proposed behavior if it's something you feel strongly about, so it has more visibility to the other devs.

@vaughnkoch
Copy link
Author

What about having the default be to raise (with an informative error message), but to add an additional optional parameter to pd.cut which would then ignore infs? That way, if someone uses the option, they're explicitly saying infs can be ignored, and if they get the error otherwise, it'll be an easy change instead of having to put an extra line to drop infs.

@jschendel
Copy link
Member

Yes, that would be a more viable alternative. Would still like a new issue opened regarding it for more discussion. To be honest, I don't think this would be a high priority item to be filled, unless it's something you'd be willing to contribute, given that it's a rather specific request that I don't think would be frequently used. Also keep in mind that there's a bit of maintenance burden associated with adding keyword arguments like this; the user workaround is a single line of code, but implementing this in the codebase requires quite a bit more work, and would need to be maintained going forward.

@HansBambel
Copy link

@jschendel Sadly using qcut also results in an error:
pd.qcut([1,2,3,4,5,-np.inf, np.inf], q=3, duplicates="drop")
results in ValueError: missing values must be missing in the same location both left and right sides

I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5.

Similar to #11113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants