API: generalized check_array_indexer for validating array-like getitem indexers #31150

jorisvandenbossche · 2020-01-20T13:00:52Z

Closes #30738. Also fixes the performance issue for other arrays from #30744, and related to #30308 (comment)

This generalizes the check_bool_array_indexer helper method that we added for 1.0.0 to not be specific to boolean arrays, but any array-like input, and ensures that the output is a proper numpy array that can be used to index into numpy arrays.

I think such a more general "check" is useful, to avoid that all (external+internal) EAs need to do what the test EAs are already doing (checking for integer arrays as well) at https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/decimal/array.py#L118-L126, and also to fix the performance issue in a general way (as was now only done for Categorical in https://github.com/pandas-dev/pandas/pull/30747/files)

If we agree on the general idea, I still need to clean up this PR (eg remove the existing check_bool_array_indexer, update the extending docs, etc)

cc @TomAugspurger

jorisvandenbossche · 2020-01-20T13:07:25Z

In general, the __getitem__ of an EA now needs to do a

    ...
    if is_list_like(key):
        key = check_array_indexer(self, key)
    ...

In principle, we could also move the is_list_like check into the check_array_indexer to make the usage even more simple. That only means that the function then needs to pass-through integers and slices/None/Ellipsis.

TomAugspurger · 2020-01-20T13:13:02Z

pandas/core/indexers.py

+    ----------
+    array : array
+        The array that's being indexed (only used for the length).
+    indexer : array-like


In a few places above, you've done is_list_like, but here we require an array (with a dtype).

Thoughts on what we want? Requiring an array is certainly easier, so that we don't have to infer the types. But users may be passing arbitrary objects to __getitem__.

We actually don't require an array with a dtype. The first thing that this function does is:

if not is_array_like(indexer): indexer = pd.array(indexer)

to deal with eg lists.

So I probably meant to update the array into "list-like" instead of "array-like"

pandas/api/indexers/__init__.py

jreback

i really hate adding extension array only code that is not used in pandas proper at all. this seems like a good candidate to use internally.

pandas/core/arrays/categorical.py

jreback · 2020-01-20T15:07:16Z

pandas/core/indexers.py

@@ -307,3 +312,62 @@ def check_bool_array_indexer(array: AnyArrayLike, mask: AnyArrayLike) -> np.ndar
    if len(result) != len(array):
        raise IndexError(f"Item wrong length {len(result)} instead of {len(array)}.")
    return result
+
+
+def check_array_indexer(array, indexer) -> np.ndarray:


can you type these at all, shouldn't indexer -> key and be Label (or maybe something more sophisticated); not looking to solve this in this PR necessarily

pandas/core/indexers.py

jbrockmendel · 2020-01-21T00:08:34Z

pandas/core/indexers.py

+    checked if there are missing values present, and it is converted to
+    the appropriate numpy array.
+
+    .. versionadded:: 1.0.0


1.0 or 1.1?

1.0 if we're planning to subsume check_bool_array_indexer.

1.0 if we're planning to subsume check_bool_array_indexer.

Yes, this is replacing check_bool_array_indexer which is already in 1.0.0, so we should do the replacement also for 1.0.0

jbrockmendel · 2020-01-21T00:09:02Z

pandas/core/indexers.py

+
+    Parameters
+    ----------
+    array : array


can this be made more specific, e.g. "np.ndarray or EA"?

It's only used to get the length, so made it "array-like" (can in principle also be a Series)

jbrockmendel · 2020-01-21T00:10:43Z

pandas/core/indexers.py

+
+    elif is_integer_dtype(dtype):
+        try:
+            indexer = np.asarray(indexer, dtype=int)


does int vs np.int64 vs np.intp matter here? are there failure modes other than the presence of NAs?

this does matter; indexers are intp

Yes, that was on my todo to fix up. Need to figure out the easiest way to convert to numpy array preserving the bit-ness of the dtype (or can we always convert to intp?)

Will update tomorrow

OK, went with np.intp. From a quick test, when you pass non-intp integers to index with numpy, it's not slower to do the conversion to intp yourself beforehand (although while writing this, what happens if you try to index with a too large int64 that doesn't fit into int32 on a 32-bit platform?)

ensure_platform_int is a well established pattern

Do you prefer to update ensure_platform_int to handle extension arrays so I can use it here? (it's basically the same as np.asarray(.., dtype=np.intp), not really sure why the code in ensure_platform_int takes more hoops, performance I suppose)

either way - but should be consistent and use only 1 pattern; ensure_platform_int is used extensively already

jreback

this needs some thorough checking

there is a lot added here that seems duplicative or needs more test coverage

jreback · 2020-01-23T04:32:58Z

pandas/core/arrays/categorical.py


        result = self._codes[key]
        if result.ndim > 1:
+            from pandas.core.indexes.base import deprecate_ndim_indexing


can this be too imported?

It seems this is possible yes. But I don't really like array code importing code from the Index classes (the dependence should be the other way around). I can maybe also move the deprecate_ndim_indexing helper function to pd.core.indexers (instead of pd.core.indexes) to have this separation cleaner.

hmm yeah I agree with this as its not specific to index. in factor I would move to pandas.compat.numpy_

pandas/core/arrays/datetimelike.py

jreback · 2020-01-23T04:35:21Z

pandas/core/arrays/sparse/array.py

@@ -768,6 +770,9 @@ def __getitem__(self, key):
                else:
                    key = np.asarray(key)

+            if is_list_like(key):


is this repeated non purpose?

repeated from where?

the next check is_bool_indexer is duplicative

It's not fully duplicative, see my long explanation at #31150 (comment). It's mainly for dealing with object dtype.

pandas/core/indexers.py

jreback · 2020-01-23T04:38:06Z

pandas/core/indexing.py

@@ -2232,7 +2232,7 @@ def check_bool_indexer(index: Index, key) -> np.ndarray:
    else:
        if is_sparse(result):
            result = result.to_dense()
-        result = check_bool_array_indexer(index, result)
+        result = np.asarray(check_array_indexer(index, result), dtype=bool)


is this not already guaranteed in the output?

can you comment here

jorisvandenbossche · 2020-01-23T13:22:32Z

So there are two remaining discussion items I think (also relating to some of the inline comments):

1) Do we want to allow object dtype masks/indexers?

Currently, some indexing routines allow object dtyped masks, and some not:

In [1]: mask = np.array([True, False, True], dtype=object) 

In [2]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])       

In [3]: s[mask]     
Out[3]: 
a    1
c    3
dtype: int64

In [4]: s.index[mask]  
Out[4]: Index(['a', 'c'], dtype='object')

In [5]: pd.Categorical(["a", "b", "c"])[mask]  
...
IndexError: arrays used as indices must be of integer (or boolean) type

In [6]: pd.array([1, 2, 3], dtype="Int64")[mask]       
...
IndexError: arrays used as indices must be of integer (or boolean) type

I assume a good reason that we allowed this was because, as Index object, booleans are always object dtype (we don't have a BooleanIndex).
But in general, I don't really like this support (at least for new functionality like this PR), since I don't see a need to support object dtyped booleans. This support also means that you need to infer for object whether it is boolean or not.

So the question is here: are we fine with the new function being more strict on the dtype?
I ensured that all cases where it was converted before in pandas 0.25, it is still doing this now (eg in DatetimeArray, there still is a is_bool_indexer that does such inference and then ensures the object dtype is converted to a boo dtype at https://github.com/pandas-dev/pandas/pull/31150/files#diff-34bddc5b2c962f58d24f594d80a8b520R523-L521. For categorical, this didn't work in 0.25.0 / before #30308, so I didn't bother supporting object dtype there).

BTW, also integers as object seems to be somewhat inconsistent (here getitem vs iloc):

In [9]: s = pd.Series([1, 2, 3], dtype="Int64", index=['a', 'b', 'c']) 

In [10]: s[np.array([0, 2], dtype=object)]   
...
IndexError: arrays used as indices must be of integer (or boolean) type

In [11]: s.iloc[np.array([0, 2], dtype=object)] 
Out[11]: 
a    1
c    3
dtype: Int64

2) What arguments to accept to check_array_indexer (and potentially pass through)?

Currently, the check_array_indexer in this PR accepts a list-like, ensures it is converted to an array-like (ndarray or EA), and then validates integer and boolean arrays. Other array dtypes are passed through, in the idea that numpy will error on other dtypes anyway (but we could also raise the error already in check_array_indexer if that is preferred)

This means that, for correct use, you need to do something like (if you have an EA that has a _data with a numpy array):

    def __getitem__(self, item):
        if is_integer(item):
            return self._data[item]

        elif is_list_like(item) and not isinstance(item, tuple):
            item = check_array_indexer(self, item)

        return type(self)(self._data[item])

The is_list_like(item) is needed because currently check_array_indexer does not handle eg slices, and the and not isinstance(item, tuple) is needed because otherwise the current check_array_indexer would convert the tuple into an array, which gives a different meaning (a tuple would mean to index multiple dimension, which will typically error for 1D numpy arrays).

So it was already mentioned somewhere above that we could move this is_list_like(item) and not isinstance(item, tuple): check into the check_array_indexer function. But that also means that we don't just accept a list-like to the function, but will need to pass through other things through the function untouched (like integer scalars, None, slice, ellipsis, which all can be valid indexers).
Are we OK with that? It makes the typing of function less strict (otherwise we could ensure the output is an ndarray) and thus the code less explicit, but it makes the usage a bit easier (don't need to repeat the list and tuple check everywhere this is used).

TomAugspurger · 2020-01-23T13:49:35Z

So the question is here: are we fine with the new function being more strict on the dtype?

Yes, I'm happy with that. We should at least have some functionality that can completely avoid dtype inference.

So it was already mentioned somewhere above that we could move this is_list_like(item) and not isinstance(item, tuple): check into the check_array_indexer function.

I think my preference is to not handle that inside check_array_indexer. I'd like to keep that focused specifically on arrays (lists too I suppose, though that's debatable I think)

If we want to make this even easier for our own / 3rd party EAs, we can add another helper like

def check_indexer(item):
    """
    Check the kind of indexer provided.

    Returns
    --------
    indexer : Any
        Integers will be passed through as is. list-likes will be converted to arrays
    kind : str
        One of 'integer-scalar', 'integer-array', 'boolean', 'slice'

That would collect all the inference and type checking in a single method. But I think that's less important than getting this array version finished up.

jorisvandenbossche · 2020-01-23T14:00:48Z

If we want to make this even easier for our own / 3rd party EAs, we can add another helper like

But wouldn't such a check_indexer helper function basically look like:

def check_indexer(array, indexer):
    if is_list_like(indexer) and not isinstance(indexer, tuple):
            indexer = check_array_indexer(array, indexer)
    return indexer

We could also decide to replace check_bool_array_indexer with such a check_indexer instead of the check_array_indexer of this PR.

I am also not sure if the "kind" is very important. In practice (when your underlying data are numpy arrays at least), you only care about the distinction between a scalar vs not a scalar (integer/boolean array, slice, ), as for the second group you need to wrap it again in your array class (and for the first case potentially wrap it in a scalar class).

jreback · 2020-01-24T03:40:48Z

pandas/core/arrays/categorical.py


        result = self._codes[key]
        if result.ndim > 1:
+            from pandas.core.indexes.base import deprecate_ndim_indexing


hmm yeah I agree with this as its not specific to index. in factor I would move to pandas.compat.numpy_

pandas/core/arrays/datetimelike.py

jreback · 2020-01-24T03:43:23Z

pandas/core/arrays/sparse/array.py

@@ -768,6 +770,9 @@ def __getitem__(self, key):
                else:
                    key = np.asarray(key)

+            if is_list_like(key):


the next check is_bool_indexer is duplicative

TomAugspurger · 2020-01-24T15:56:22Z

But wouldn't such a check_indexer helper function basically look like:

Mmm, fair point. Happy to defer to your judgement here :)

jorisvandenbossche · 2020-01-24T16:00:12Z

We could also have both check_indexer and the more specialized check_array_indexer ?

jorisvandenbossche · 2020-01-24T16:43:55Z

pandas/tests/indexes/categorical/test_category.py

-        # GH#30588 multi-dim indexing is deprecated, but raising is also acceptable
-        idx = self.create_index()
-        with pytest.raises(ValueError, match="cannot mask with array containing NA"):
-            idx[:, None]


This was removed because the base class version (which checks for the deprecation) now passes (since I added the deprecation warning)

TomAugspurger · 2020-01-28T14:05:28Z

Either is fine by me.

If this is exclusively or primarily used in __getitem__, then check_array_indexer should handle that for the user, since the type in __getitem__ is so broad.

If we're using it elsewhere where we know we already have an array, then skipping that check is nice.

In this PR, it looks like __getitem__ is the primary users, so we can go ahead and include the check I think.

jorisvandenbossche · 2020-01-28T15:15:42Z

OK, added a commit moving the check inside. Now it only validates array-likes (list-likes which are not tuples) and passes through everything else.

Maybe remaining question is if we want to rename this to check_indexer instead of check_array_indexer.
I am fine with keeping the name, to have the "array" point to the fact that it's checking (array) input to index arrays (eg input to index a Series or DataFrame would have other checks)

jbrockmendel · 2020-01-28T17:18:06Z

I'm a little late weighing in, but I'd prefer to have the function require as tight an argument type as possible (arraylike maybe?). In trying to optimize our lookups, avoidable type checks are some of the lowest-hanging fruit

jorisvandenbossche · 2020-01-28T17:22:26Z

It's as simple as undoing the last commit, but you will need to find an agreement with @jreback

In the end, I think how it is now (the list-like + tuple check inside the function) makes it easier to use this (otherwise you need that check every time the function is called).
I would also be fine with two functions: check_array_indexer (which is strict) and check_indexer (which then has the additional type check and otherwise calls check_array_indexer), if that keeps things cleaner. See proposal above #31150 (comment)

In trying to optimize our lookups, avoidable type checks are some of the lowest-hanging fruit

Not sure I understand this point?

jreback · 2020-01-28T17:25:50Z

@jorisvandenbossche

actually I was ok with the original

It's as simple as undoing the last commit, but you will need to find an agreement with @jreback

I agree that this should have a tight type check. So it should require a array-like (rather than coercing).

jorisvandenbossche · 2020-01-28T17:31:34Z

actually I was ok with the original

I interpreted "personally I would actually relax this and avoid the need to have is_list_like checks before calling this." differently :)
Now, if you check the last commit (1ca35d1), I think it actually made things simpler in the __getitem__ functions.

I agree that this should have a tight type check. So it should require a array-like (rather than coercing).

No, that's not possible (and that's not what the last commit did). It needs to accept things like lists, as that is a valid array-like indexer.
The "tighter typing" that I mentioned above is about having a return value that is always ndarray, the input always has been list-like (and with the latest commit that changed to Any)

jreback · 2020-01-28T17:36:55Z

actually I was ok with the original

I interpreted "personally I would actually relax this and avoid the need to have is_list_like checks before calling this." differently :)
Now, if you check the last commit (1ca35d1), I think it actually made things simpler in the __getitem__ functions.

I agree that this should have a tight type check. So it should require a array-like (rather than coercing).

No, that's not possible (and that's not what the last commit did). It needs to accept things like lists, as that is a valid array-like indexer.
The "tighter typing" that I mentioned above is about having a return value that is always ndarray, the input always has been list-like (and with the latest commit that changed to Any)

ok i see it. ok then.

jreback · 2020-01-28T17:37:24Z

what about the issue of object array? e.g. does this eliminate the need for is_bool_indexer? or deferring that?

jorisvandenbossche · 2020-01-28T17:40:58Z

what about the issue of object array? e.g. does this eliminate the need for is_bool_indexer? or deferring that?

So the new check_array_indexer doesn't allow object dtype as boolean indexer. Therefore, everywhere we supported this before, is/check_bool_indexer is still being used for backwards compatibility in addition to check_array_indexer.
We should deprecate that some day, but right now with boolean index being object this is still too early I think.

jreback · 2020-01-28T17:42:53Z

what about the issue of object array? e.g. does this eliminate the need for is_bool_indexer? or deferring that?

So the new check_array_indexer doesn't allow object dtype as boolean indexer. Therefore, everywhere we supported this before, is/check_bool_indexer is still being used for backwards compatibility in addition to check_array_indexer.
We should deprecated that some day, but right now with boolean index being object this is still too early I think.

ok fair enough, maybe just rename is_bool_indexer then (doesn't have to be in this PR), to is_bool_indexer_for_object_array (too long, but that's the idea)

TomAugspurger · 2020-01-28T22:46:04Z

Fixed the merge conflict.

TomAugspurger · 2020-01-29T12:04:43Z

@jbrockmendel I think your comments in #31150 (comment) can be resolved in some places, but for a general __getitem__ we have to handle the fact that key can have any type / dtype.

Thanks @jorisvandenbossche!

TomAugspurger · 2020-01-29T12:14:46Z

@meeseeksdev backport to 1.0.x

lumberbot-app · 2020-01-29T12:15:23Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

$ git checkout 1.0.x
$ git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

$ git cherry-pick -m1 b1214af3528319e31334ff098021c5a1a6d9ffcd

You will likely have some merge/cherry-pick conflict here, fix them and commit:

$ git commit -am 'Backport PR #31150: API: generalized check_array_indexer for validating array-like getitem indexers'

Push to a named branch :

git push YOURFORK 1.0.x:auto-backport-of-pr-31150-on-1.0.x

Create a PR against branch 1.0.x, I would have named this PR:

"Backport PR #31150 on branch 1.0.x"

And apply the correct labels and milestones.

Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon!

If these instruction are inaccurate, feel free to suggest an improvement.

…m indexers (pandas-dev#31150)

…m indexers (#31150) (#31419)

API: generalized check_array_indexer for validating array-like indexers

e8f539a

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 20, 2020

jorisvandenbossche added this to the 1.0.0 milestone Jan 20, 2020

test boolean message as well

4fa9f5a

TomAugspurger reviewed Jan 20, 2020

View reviewed changes

fixes for failing tests

b55dfd2

jreback requested changes Jan 20, 2020

View reviewed changes

jbrockmendel reviewed Jan 21, 2020

View reviewed changes

jorisvandenbossche added 5 commits January 22, 2020 10:31

Merge remote-tracking branch 'upstream/master' into EA-check-indexer

095b741

remove previous check_bool_array_indexer

58bfe78

don't convert tuples to avoid warning from numpy

5ce8d85

ensure check_bool_indexer returns numpy array

ebc2150

raise warning for categorical

4a51d97

TomAugspurger mentioned this pull request Jan 22, 2020

RLS: 1.0.0 #27492

Closed

jreback requested changes Jan 23, 2020

View reviewed changes

jreback requested changes Jan 24, 2020

View reviewed changes

jorisvandenbossche added 4 commits January 24, 2020 17:09

Merge remote-tracking branch 'upstream/master' into EA-check-indexer

50490aa

move deprecate_ndim_indexing

c979df8

cleanup; ensure output of check_array_indexer is always an ndarray

ce2e042

clean-up black reformatting

4d447bf

jorisvandenbossche commented Jan 24, 2020

View reviewed changes

allow list-length-1-with-slice corner case

3c5e4c6

move list-like check inside

1ca35d1

TomAugspurger approved these changes Jan 28, 2020

View reviewed changes

jreback approved these changes Jan 28, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into EA-check-indexer

e5ea9b4

TomAugspurger merged commit b1214af into pandas-dev:master Jan 29, 2020

lumberbot-app bot added the Still Needs Manual Backport label Jan 29, 2020

jorisvandenbossche deleted the EA-check-indexer branch January 29, 2020 13:46

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jan 29, 2020

API: generalized check_array_indexer for validating array-like getite…

22f9d9f

…m indexers (pandas-dev#31150)

jorisvandenbossche mentioned this pull request Jan 29, 2020

Backport PR #31150: API: generalized check_array_indexer for validating array-like getitem indexers #31419

Merged

jorisvandenbossche removed the Still Needs Manual Backport label Jan 29, 2020

jorisvandenbossche mentioned this pull request Jan 29, 2020

Update GeometryArray.__getitem__ for pandas 1.0 changes geopandas/geopandas#1272

Merged

jorisvandenbossche added a commit that referenced this pull request Jan 29, 2020

API: generalized check_array_indexer for validating array-like getite…

d633915

…m indexers (#31150) (#31419)

jorisvandenbossche mentioned this pull request Jan 30, 2020

REGR: Array.__setitem__ failing with nullable boolean mask #31446

Closed

ShaharNaveh mentioned this pull request Feb 29, 2020

API: DatetimeIndex / TimedeltaIndex 2d slicing should result in Index #10774

Closed

rohitkg98 mentioned this pull request May 2, 2020

Performance regression in DataFrame[bool_indexer] #33924

Closed

API: generalized check_array_indexer for validating array-like getitem indexers #31150

API: generalized check_array_indexer for validating array-like getitem indexers #31150

Conversation

jorisvandenbossche commented Jan 20, 2020

jorisvandenbossche commented Jan 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 23, 2020

TomAugspurger commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 24, 2020

jorisvandenbossche commented Jan 24, 2020

Choose a reason for hiding this comment

TomAugspurger commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020

jbrockmendel commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020

jreback commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020 • edited Loading

jreback commented Jan 28, 2020

jreback commented Jan 28, 2020

jorisvandenbossche commented Jan 28, 2020 • edited Loading

jreback commented Jan 28, 2020

TomAugspurger commented Jan 28, 2020

TomAugspurger commented Jan 29, 2020

TomAugspurger commented Jan 29, 2020

lumberbot-app bot commented Jan 29, 2020

jorisvandenbossche Jan 22, 2020 •

edited

Loading

jorisvandenbossche commented Jan 23, 2020 •

edited

Loading

jorisvandenbossche commented Jan 28, 2020 •

edited

Loading

jorisvandenbossche commented Jan 28, 2020 •

edited

Loading