Skip to content

Handle ExtensionArrays in cut #31389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TomAugspurger opened this issue Jan 28, 2020 · 4 comments
Open

Handle ExtensionArrays in cut #31389

TomAugspurger opened this issue Jan 28, 2020 · 4 comments
Labels
cut cut, qcut Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@TomAugspurger
Copy link
Contributor

Followup to #31290. Currently pd.cut doesn't play nicely with all extension arrays. To support them, I think we'll need one addition to the interface.

We need an array of integers to pass to searchsorted in

ids = ensure_int64(bins.searchsorted(x, side=side))
. I think the only requirement is that the integer-encoded values need to have the same ordering as the original values. (I forget the math term for this type of mapping).

It doesn't matter what value is used for missing values, as long as it's distinct.

We can't quite use factorize(arr)[0] since it doesn't have the ordering requirement.

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jan 28, 2020
@jschendel jschendel added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 28, 2020
@jorisvandenbossche
Copy link
Member

Another option is some masked version (and then the option for the EA to either return a mask, or either return an na_value that sorts correctly).

For this specific case, I think we know that bins will not have any NAs? And so the missing values in x just need to be propagated? So they don't actually need to be searchsorted correctly, we could mask them out.

This might be solved together with making searchsorted work with arrays that use pd.NA (#30944)

@TomAugspurger
Copy link
Contributor Author

For this specific case, I think we know that bins will not have any NAs? And so the missing values in x just need to be propagated? So they don't actually need to be searchsorted correctly, we could mask them out.

AFAICT, yes. We can pass nonsense values for NA values into np.searchsorted(bins, x) since they're masked later on. The only requirement is that it's an integer.

@dsaxton
Copy link
Member

dsaxton commented Jan 30, 2020

I also noticed that qcut is acting funny:

In [1]: import numpy as np                                                                                                                                                      

In [2]: import pandas as pd                                                                                                                                                     

In [3]: arr = pd.array(np.arange(100), dtype="Int64")                                                                                                                           

In [4]: pd.qcut(arr, 5)                                                                                                                                                         
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-24061a2bd8ff> in <module>
----> 1 pd.qcut(arr, 5)

~/pandas/pandas/core/reshape/tile.py in qcut(x, q, labels, retbins, precision, duplicates)
    353         include_lowest=True,
    354         dtype=dtype,
--> 355         duplicates=duplicates,
    356     )
    357 

~/pandas/pandas/core/reshape/tile.py in _bins_to_cuts(x, bins, right, labels, precision, include_lowest, dtype, duplicates)
    395 
    396     if include_lowest:
--> 397         ids[x == bins[0]] = 1
    398 
    399     na_mask = isna(x) | (ids == len(bins)) | (ids == 0)

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [5]: pd.__version__                                                                                                                                                          
Out[5]: '1.0.0rc0+240.g92745fc78'

Seems the problem there is with indexing into a numpy array using a BooleanArray. I can open up a PR that I think fixes it.

@jbrockmendel jbrockmendel added the cut cut, qcut label Feb 25, 2020
@mroeschke mroeschke removed the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label May 13, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@bernwaldl
Copy link

bernwaldl commented Aug 9, 2023

I ran into a problem today that resulted in incorrectly binned integers. I am working with unix timestamps and try to bin them into intervals. My minimal working example is the following.

import pandas as pd
series = pd.Series([
                    1690658391162578177,
                    1690658391162578182,
                    ])
cutoffs = pd.Series([0, 1690658391162578182]).astype(pd.UInt64Dtype())
print(pd.cut(series, bins=pd.IntervalIndex.from_breaks(cutoffs, closed='left')))
series = series.astype(pd.UInt64Dtype())
print(pd.cut(series, bins=pd.IntervalIndex.from_breaks(cutoffs, closed='left')))

The tow calls to cut result in different outputs depending on whether the series is given with the type UInt64Dtype or simply as a regular integer, because in the former case the values are cast to a float64 here. I am on version 2.0.3 of pandas.
Although I do not fully understand the problem and think it might be more complex, I think the code should at least throw a warning or there should be a warning on the documentation page of the function, that highlights, that it does not work correctly for all datatypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cut cut, qcut Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

7 participants