-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Numpy 'inf' values cause pandas.cut to fail #24314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This should raise, as A more concise example of the error in question: In [2]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3)
---------------------------------------------------------------------------
ValueError: Bin edges must be unique: array([nan, inf, inf, inf]).
You can drop duplicate edges by setting the 'duplicates' kwarg
In [3]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=3, duplicates='drop')
---------------------------------------------------------------------------
ValueError: missing values must be missing in the same location both left and right sides Note that As to why this is invalid, from the documentation for
Specifying The way to handle this would be to specify In [4]: pd.cut([0, 1, 2, 3, 4, np.inf], bins=[-1, 2, np.inf])
Out[4]:
[(-1.0, 2.0], (-1.0, 2.0], (-1.0, 2.0], (2.0, inf], (2.0, inf], (2.0, inf]]
Categories (2, interval[float64]): [(-1.0, 2.0] < (2.0, inf]]
In [5]: pd.qcut([0, 1, 2, 3, 4, np.inf], q=3)
Out[5]:
[(-0.001, 1.667], (-0.001, 1.667], (1.667, 3.333], (1.667, 3.333], (3.333, inf], (3.333, inf]]
Categories (3, interval[float64]): [(-0.001, 1.667] < (1.667, 3.333] < (3.333, inf]] |
That makes sense, thanks for sharing. What do you think of having it just ignore inf values (and if there no binnable values, then raise?) |
I'd be a bit hesitant to silently ignore infinite values. This would make the handling of infinity a bit inconsistent within All that being said, I wouldn't necessarily be opposed to changing the behavior in that regard if a consensus is reached to support it. I think it would warrant a larger discussion though. Feel free to open a new issue regarding the proposed behavior if it's something you feel strongly about, so it has more visibility to the other devs. |
What about having the default be to raise (with an informative error message), but to add an additional optional parameter to pd.cut which would then ignore infs? That way, if someone uses the option, they're explicitly saying infs can be ignored, and if they get the error otherwise, it'll be an easy change instead of having to put an extra line to drop infs. |
Yes, that would be a more viable alternative. Would still like a new issue opened regarding it for more discussion. To be honest, I don't think this would be a high priority item to be filled, unless it's something you'd be willing to contribute, given that it's a rather specific request that I don't think would be frequently used. Also keep in mind that there's a bit of maintenance burden associated with adding keyword arguments like this; the user workaround is a single line of code, but implementing this in the codebase requires quite a bit more work, and would need to be maintained going forward. |
@jschendel Sadly using qcut also results in an error: I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5. Similar to #11113 |
Code Sample, a copy-pastable example if possible
Problem description
Having an 'inf' value in a Series seems to cause pandas.cut to fail with this error:
*** ValueError: missing values must be missing in the same location both left and right sides
I saw bug #19768 already, but that was fixed by PR 19833 in Feb, and I'm using 0.23.4 which was released on August 3, 2018. Also there's #5483, which was fixed a long time ago.
Expected Output
'inf' should probably be similar to the current NA-handling behavior: it should at least not raise an exception, and just drop that as a usable value.
Output of
pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.2.0
Cython: 0.29.2
numpy: 1.15.1
scipy: 1.1.0
pyarrow: 0.11.0
xarray: None
IPython: 6.5.0
sphinx: 1.8.2
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.10
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: 4.2.4
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: