-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit #47996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For each item, the same interval close form is expected. So now we set the lowest edge as |
I agree with you. Although the docs suggest the following for bins:
I think this is unnecessary and, as you put it, annoying. Also I think this function is probably quite useful (for categorising continuous data in an ML context) and possibly likely to increase in usage. It could do with revamp. It was also added to #40245 without yet being addressed. For
Additionally For example >>> s = np.random.rand(100)
>>> s[0], s[99] = 0, 1
>>> df = pd.DataFrame({"samples": s})
>>> pd.cut(df["samples"], bins=10, retbins=True)
...
Categories (10, interval[float64, right]): [(-0.001, 0.1] < (0.1, 0.2] < (0.2, 0.3] < (0.3, 0.4] ... (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1.0]]
>>> np.histogram(df["samples"])
(array([ 7, 4, 11, 14, 11, 9, 8, 14, 11, 11]),
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])) If doing: |
Indeed,
That is precisely my main objection. The current behavior of this function makes stuff up. It's just a very, very poor hack. |
I am facing the same issue right now of import pandas as pd
s = pd.Series([1,1,2,2,3,4,5])
pd.cut(s, bins = [-np.inf, 2, np.inf], right = False) returns
|
PRs are welcome. It is always encouraged to support a software that is free and is built by unpaid volunteers. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
https://github.com/FlorinAndrei/misc/blob/master/HeartDisease.csv
Issue Description
The lowest of the three bins created is:
(28.951, 45.0]
. This is incorrect in several ways.First off, I expect a left-inclusive bin there. That bin is not left-inclusive.
Secondly, the minimum value in that column is 29. It is not 28.951 - that float is an artifact of the library and does not exist in the data.
One workaround you can find online is this:
But this is pointless and annoying. The library should simply return the true minimum value.
Expected Behavior
The bin I expect is
[29.0, 45.0]
. I would also settle for[29, 45]
.Installed Versions
I do have the latest Pandas installed (1.4.3) but there's another bug now with
show_versions()
that prevents me from printing that info.Python 3.10.6
Numpy 1.22.4
Windows 10
Jupyter Notebook
The text was updated successfully, but these errors were encountered: