-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Handle ExtensionArrays in cut #31389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Another option is some masked version (and then the option for the EA to either return a mask, or either return an na_value that sorts correctly). For this specific case, I think we know that This might be solved together with making |
AFAICT, yes. We can pass nonsense values for NA values into |
I also noticed that In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: arr = pd.array(np.arange(100), dtype="Int64")
In [4]: pd.qcut(arr, 5)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-4-24061a2bd8ff> in <module>
----> 1 pd.qcut(arr, 5)
~/pandas/pandas/core/reshape/tile.py in qcut(x, q, labels, retbins, precision, duplicates)
353 include_lowest=True,
354 dtype=dtype,
--> 355 duplicates=duplicates,
356 )
357
~/pandas/pandas/core/reshape/tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates)
395
396 if include_lowest:
--> 397 ids[x == bins[0]] = 1
398
399 na_mask = isna(x) | (ids == len(bins)) | (ids == 0)
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
In [5]: pd.__version__
Out[5]: '1.0.0rc0+240.g92745fc78' Seems the problem there is with indexing into a numpy array using a BooleanArray. I can open up a PR that I think fixes it. |
I ran into a problem today that resulted in incorrectly binned integers. I am working with unix timestamps and try to bin them into intervals. My minimal working example is the following. import pandas as pd
series = pd.Series([
1690658391162578177,
1690658391162578182,
])
cutoffs = pd.Series([0, 1690658391162578182]).astype(pd.UInt64Dtype())
print(pd.cut(series, bins=pd.IntervalIndex.from_breaks(cutoffs, closed='left')))
series = series.astype(pd.UInt64Dtype())
print(pd.cut(series, bins=pd.IntervalIndex.from_breaks(cutoffs, closed='left'))) The tow calls to |
Followup to #31290. Currently
pd.cut
doesn't play nicely with all extension arrays. To support them, I think we'll need one addition to the interface.We need an array of integers to pass to searchsorted in
pandas/pandas/core/reshape/tile.py
Line 394 in 4edcc55
It doesn't matter what value is used for missing values, as long as it's distinct.
We can't quite use
factorize(arr)[0]
since it doesn't have the ordering requirement.The text was updated successfully, but these errors were encountered: