Numpy 'inf' values cause pandas.cut to fail #24314

vaughnkoch · 2018-12-16T23:02:14Z

Code Sample, a copy-pastable example if possible

import pandas as pd
foo = pd.Series([1, 2, 3,])
bar = pd.Series([1, 2, 0])
baz = foo / bar
cut = pd.cut(baz, 8, duplicates='drop')

*** ValueError: missing values must be missing in the same location both left and right sides

Problem description

Having an 'inf' value in a Series seems to cause pandas.cut to fail with this error:
*** ValueError: missing values must be missing in the same location both left and right sides

I saw bug #19768 already, but that was fixed by PR 19833 in Feb, and I'm using 0.23.4 which was released on August 3, 2018. Also there's #5483, which was fixed a long time ago.

Expected Output

'inf' should probably be similar to the current NA-handling behavior: it should at least not raise an exception, and just drop that as a usable value.

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.11.0
xarray: None
IPython: 6.5.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.10
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: 4.2.4
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jschendel · 2018-12-18T01:34:48Z

This should raise, as bins should not be specified as an integer when the input data contains infinity. The error message could certainly be improved though.

A more concise example of the error in question:

In [2]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3)
---------------------------------------------------------------------------
ValueError: Bin edges must be unique: array([nan, inf, inf, inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [3]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3, duplicates='drop')
---------------------------------------------------------------------------
ValueError: missing values must be missing in the same location both left and right sides

Note that duplicates='drop' is just delaying the error from [2].

As to why this is invalid, from the documentation for cut we have the following description for bins:

bins : int, sequence of scalars, or pandas.IntervalIndex
The criteria to bin by.

int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

Specifying bins as an integer means getting that number of equal width bins that the span the range of the input data, but when your input data contains infinity then the range is infinite, so each bucket would also need to be infinite, which doesn't make sense.

The way to handle this would be to specify bins using one of the alternative options, or to use qcut:

In [4]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=[-1, 2, np.inf])
Out[4]:
[(-1.0, 2.0], (-1.0, 2.0], (-1.0, 2.0], (2.0, inf], (2.0, inf], (2.0, inf]]
Categories (2, interval[float64]): [(-1.0, 2.0] < (2.0, inf]]

In [5]: pd.qcut([0, 1, 2, 3, 4, np.inf], q=3)
Out[5]:
[(-0.001, 1.667], (-0.001, 1.667], (1.667, 3.333], (1.667, 3.333], (3.333, inf], (3.333, inf]]
Categories (3, interval[float64]): [(-0.001, 1.667] < (1.667, 3.333] < (3.333, inf]]

vaughnkoch · 2018-12-18T01:39:18Z

That makes sense, thanks for sharing.

What do you think of having it just ignore inf values (and if there no binnable values, then raise?)
If you have inf values in the source data, the caller has to do extra work to remove them anyways, which pandas could do as well. Then you could just put that in the docs that inf is converted to NA.

jschendel · 2018-12-18T02:03:44Z

What do you think of having it just ignore inf values (and if there no binnable values, then raise?)

I'd be a bit hesitant to silently ignore infinite values. This would make the handling of infinity a bit inconsistent within cut since infinity is valid for other ways of using cut. It would also make it harder for users to diagnose if they're using the wrong combination of parameters with cut, or if something is wrong with their assumptions about their data (infinity being present when it shouldn't). Additionally, I'm not aware of any other methods that handle infinity in a similar way (could be forgetting something), so it'd be setting a new precedence within the codebase in regards to infinity.

All that being said, I wouldn't necessarily be opposed to changing the behavior in that regard if a consensus is reached to support it. I think it would warrant a larger discussion though. Feel free to open a new issue regarding the proposed behavior if it's something you feel strongly about, so it has more visibility to the other devs.

vaughnkoch · 2018-12-18T02:09:45Z

What about having the default be to raise (with an informative error message), but to add an additional optional parameter to pd.cut which would then ignore infs? That way, if someone uses the option, they're explicitly saying infs can be ignored, and if they get the error otherwise, it'll be an easy change instead of having to put an extra line to drop infs.

jschendel · 2018-12-19T01:49:50Z

Yes, that would be a more viable alternative. Would still like a new issue opened regarding it for more discussion. To be honest, I don't think this would be a high priority item to be filled, unless it's something you'd be willing to contribute, given that it's a rather specific request that I don't think would be frequently used. Also keep in mind that there's a bit of maintenance burden associated with adding keyword arguments like this; the user workaround is a single line of code, but implementing this in the codebase requires quite a bit more work, and would need to be maintained going forward.

HansBambel · 2023-01-30T16:44:31Z

@jschendel Sadly using qcut also results in an error:
pd.qcut([1,2,3,4,5,-np.inf, np.inf], q=3, duplicates="drop")
results in ValueError: missing values must be missing in the same location both left and right sides

I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5.

Similar to #11113

jschendel added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Dec 18, 2018

jschendel added this to the 0.24.0 milestone Dec 18, 2018

jschendel mentioned this issue Dec 18, 2018

ERR: Improve error message for cut with infinity in input and integer bins #24327

Merged

3 tasks

jreback closed this as completed in #24327 Dec 18, 2018

HansBambel mentioned this issue Jan 31, 2023

BUG: qcut does not create bins when values contain np.inf #51085

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy 'inf' values cause pandas.cut to fail #24314

Numpy 'inf' values cause pandas.cut to fail #24314

vaughnkoch commented Dec 16, 2018

INSTALLED VERSIONS

jschendel commented Dec 18, 2018

vaughnkoch commented Dec 18, 2018

jschendel commented Dec 18, 2018

vaughnkoch commented Dec 18, 2018

jschendel commented Dec 19, 2018

HansBambel commented Jan 30, 2023

Numpy 'inf' values cause pandas.cut to fail #24314

Numpy 'inf' values cause pandas.cut to fail #24314

Comments

vaughnkoch commented Dec 16, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jschendel commented Dec 18, 2018

vaughnkoch commented Dec 18, 2018

jschendel commented Dec 18, 2018

vaughnkoch commented Dec 18, 2018

jschendel commented Dec 19, 2018

HansBambel commented Jan 30, 2023

Output of `pd.show_versions()`