-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: generalized check_array_indexer for validating array-like getitem indexers #31150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: generalized check_array_indexer for validating array-like getitem indexers #31150
Conversation
In general, the
In principle, we could also move the |
pandas/core/indexers.py
Outdated
---------- | ||
array : array | ||
The array that's being indexed (only used for the length). | ||
indexer : array-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a few places above, you've done is_list_like
, but here we require an array (with a dtype).
Thoughts on what we want? Requiring an array is certainly easier, so that we don't have to infer the types. But users may be passing arbitrary objects to __getitem__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't require an array with a dtype. The first thing that this function does is:
if not is_array_like(indexer):
indexer = pd.array(indexer)
to deal with eg lists.
So I probably meant to update the array into "list-like" instead of "array-like"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i really hate adding extension array only code that is not used in pandas proper at all. this seems like a good candidate to use internally.
pandas/core/indexers.py
Outdated
@@ -307,3 +312,62 @@ def check_bool_array_indexer(array: AnyArrayLike, mask: AnyArrayLike) -> np.ndar | |||
if len(result) != len(array): | |||
raise IndexError(f"Item wrong length {len(result)} instead of {len(array)}.") | |||
return result | |||
|
|||
|
|||
def check_array_indexer(array, indexer) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type these at all, shouldn't indexer -> key and be Label (or maybe something more sophisticated); not looking to solve this in this PR necessarily
pandas/core/indexers.py
Outdated
checked if there are missing values present, and it is converted to | ||
the appropriate numpy array. | ||
|
||
.. versionadded:: 1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0 or 1.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0 if we're planning to subsume check_bool_array_indexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0 if we're planning to subsume check_bool_array_indexer.
Yes, this is replacing check_bool_array_indexer
which is already in 1.0.0, so we should do the replacement also for 1.0.0
pandas/core/indexers.py
Outdated
|
||
Parameters | ||
---------- | ||
array : array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be made more specific, e.g. "np.ndarray or EA"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only used to get the length, so made it "array-like" (can in principle also be a Series)
pandas/core/indexers.py
Outdated
|
||
elif is_integer_dtype(dtype): | ||
try: | ||
indexer = np.asarray(indexer, dtype=int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does int vs np.int64 vs np.intp matter here? are there failure modes other than the presence of NAs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does matter; indexers are intp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was on my todo to fix up. Need to figure out the easiest way to convert to numpy array preserving the bit-ness of the dtype (or can we always convert to intp?)
Will update tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, went with np.intp. From a quick test, when you pass non-intp integers to index with numpy, it's not slower to do the conversion to intp yourself beforehand (although while writing this, what happens if you try to index with a too large int64 that doesn't fit into int32 on a 32-bit platform?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure_platform_int is a well established pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you prefer to update ensure_platform_int
to handle extension arrays so I can use it here? (it's basically the same as np.asarray(.., dtype=np.intp)
, not really sure why the code in ensure_platform_int
takes more hoops, performance I suppose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either way - but should be consistent and use only 1 pattern; ensure_platform_int is used extensively already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs some thorough checking
there is a lot added here that seems duplicative or needs more test coverage
pandas/core/arrays/categorical.py
Outdated
|
||
result = self._codes[key] | ||
if result.ndim > 1: | ||
from pandas.core.indexes.base import deprecate_ndim_indexing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be too imported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this is possible yes. But I don't really like array code importing code from the Index classes (the dependence should be the other way around). I can maybe also move the deprecate_ndim_indexing
helper function to pd.core.indexers
(instead of pd.core.indexes
) to have this separation cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah I agree with this as its not specific to index. in factor I would move to pandas.compat.numpy_
pandas/core/arrays/sparse/array.py
Outdated
@@ -768,6 +770,9 @@ def __getitem__(self, key): | |||
else: | |||
key = np.asarray(key) | |||
|
|||
if is_list_like(key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this repeated non purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeated from where?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the next check is_bool_indexer is duplicative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not fully duplicative, see my long explanation at #31150 (comment). It's mainly for dealing with object dtype.
pandas/core/indexing.py
Outdated
@@ -2232,7 +2232,7 @@ def check_bool_indexer(index: Index, key) -> np.ndarray: | |||
else: | |||
if is_sparse(result): | |||
result = result.to_dense() | |||
result = check_bool_array_indexer(index, result) | |||
result = np.asarray(check_array_indexer(index, result), dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not already guaranteed in the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment here
So there are two remaining discussion items I think (also relating to some of the inline comments): 1) Do we want to allow object dtype masks/indexers? Currently, some indexing routines allow object dtyped masks, and some not:
I assume a good reason that we allowed this was because, as Index object, booleans are always object dtype (we don't have a BooleanIndex). So the question is here: are we fine with the new function being more strict on the dtype? BTW, also integers as object seems to be somewhat inconsistent (here getitem vs iloc):
2) What arguments to accept to Currently, the This means that, for correct use, you need to do something like (if you have an EA that has a
The So it was already mentioned somewhere above that we could move this |
Yes, I'm happy with that. We should at least have some functionality that can completely avoid dtype inference.
I think my preference is to not handle that inside If we want to make this even easier for our own / 3rd party EAs, we can add another helper like def check_indexer(item):
"""
Check the kind of indexer provided.
Returns
--------
indexer : Any
Integers will be passed through as is. list-likes will be converted to arrays
kind : str
One of 'integer-scalar', 'integer-array', 'boolean', 'slice' That would collect all the inference and type checking in a single method. But I think that's less important than getting this array version finished up. |
But wouldn't such a
We could also decide to replace I am also not sure if the "kind" is very important. In practice (when your underlying data are numpy arrays at least), you only care about the distinction between a scalar vs not a scalar (integer/boolean array, slice, ), as for the second group you need to wrap it again in your array class (and for the first case potentially wrap it in a scalar class). |
pandas/core/arrays/categorical.py
Outdated
|
||
result = self._codes[key] | ||
if result.ndim > 1: | ||
from pandas.core.indexes.base import deprecate_ndim_indexing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah I agree with this as its not specific to index. in factor I would move to pandas.compat.numpy_
pandas/core/arrays/sparse/array.py
Outdated
@@ -768,6 +770,9 @@ def __getitem__(self, key): | |||
else: | |||
key = np.asarray(key) | |||
|
|||
if is_list_like(key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the next check is_bool_indexer is duplicative
Mmm, fair point. Happy to defer to your judgement here :) |
We could also have both |
# GH#30588 multi-dim indexing is deprecated, but raising is also acceptable | ||
idx = self.create_index() | ||
with pytest.raises(ValueError, match="cannot mask with array containing NA"): | ||
idx[:, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed because the base class version (which checks for the deprecation) now passes (since I added the deprecation warning)
Either is fine by me. If this is exclusively or primarily used in If we're using it elsewhere where we know we already have an array, then skipping that check is nice. In this PR, it looks like |
OK, added a commit moving the check inside. Now it only validates array-likes (list-likes which are not tuples) and passes through everything else. Maybe remaining question is if we want to rename this to |
I'm a little late weighing in, but I'd prefer to have the function require as tight an argument type as possible (arraylike maybe?). In trying to optimize our lookups, avoidable type checks are some of the lowest-hanging fruit |
It's as simple as undoing the last commit, but you will need to find an agreement with @jreback In the end, I think how it is now (the list-like + tuple check inside the function) makes it easier to use this (otherwise you need that check every time the function is called).
Not sure I understand this point? |
actually I was ok with the original
I agree that this should have a tight type check. So it should require a array-like (rather than coercing). |
I interpreted "personally I would actually relax this and avoid the need to have is_list_like checks before calling this." differently :)
No, that's not possible (and that's not what the last commit did). It needs to accept things like lists, as that is a valid array-like indexer. |
ok i see it. ok then. |
what about the issue of object array? e.g. does this eliminate the need for is_bool_indexer? or deferring that? |
So the new |
ok fair enough, maybe just rename is_bool_indexer then (doesn't have to be in this PR), to is_bool_indexer_for_object_array (too long, but that's the idea) |
Fixed the merge conflict. |
@jbrockmendel I think your comments in #31150 (comment) can be resolved in some places, but for a general Thanks @jorisvandenbossche! |
@meeseeksdev backport to 1.0.x |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulation you did some good work ! Hopefully your backport PR will be tested by the continuous integration and merged soon! If these instruction are inaccurate, feel free to suggest an improvement. |
Closes #30738. Also fixes the performance issue for other arrays from #30744, and related to #30308 (comment)
This generalizes the
check_bool_array_indexer
helper method that we added for 1.0.0 to not be specific to boolean arrays, but any array-like input, and ensures that the output is a proper numpy array that can be used to index into numpy arrays.I think such a more general "check" is useful, to avoid that all (external+internal) EAs need to do what the test EAs are already doing (checking for integer arrays as well) at https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/decimal/array.py#L118-L126, and also to fix the performance issue in a general way (as was now only done for Categorical in https://github.com/pandas-dev/pandas/pull/30747/files)
If we agree on the general idea, I still need to clean up this PR (eg remove the existing
check_bool_array_indexer
, update the extending docs, etc)cc @TomAugspurger