Skip to content

Unexpected behavior in cut() with nullable Int64 dtype #30787

Closed
@sdmccabe

Description

@sdmccabe

Code Sample

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]

breaks_cut = pd.cut(series, breaks)
breaks_cut
0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (0.0, 2.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Problem Description

When using the pd.Int64 nullable integer data type, pd.cut() unexpectedly bins the first non-np.nan value after an np.nan into the lowest interval. In the above example, the number 6 is binned into (0.0, 2.0].

Expected Output

0           NaN
1    (0.0, 2.0]
2    (0.0, 2.0]
3    (2.0, 4.0]
4    (2.0, 4.0]
5           NaN
6    (4.0, 6.0]
7    (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]

Note that using an IntervalIndex produces the expected output.

import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]
intervals = [pd.Interval(x, y) for x, y in zip(breaks[:-1], breaks[1:])]
interval_index = pd.IntervalIndex(intervals)

interval_cut = pd.cut(series, interval_index)
interval_cut

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.0.0-37-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 0.25.3
numpy            : 1.17.3
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 44.0.0.post20200102
Cython           : None
pytest           : 5.3.2
hypothesis       : None
sphinx           : 2.3.1
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.4.2
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.1
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.4.2
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None

Metadata

Metadata

Labels

ExtensionArrayExtending pandas with custom dtypes or arrays.Missing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNA - MaskedArraysRelated to pd.NA and nullable extension arraysNeeds TestsUnit test(s) needed to prevent regressionscutcut, qcutgood first issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions