-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
String dtype: fix isin() values handling for python storage #59759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: fix isin() values handling for python storage #59759
Conversation
if not lib.is_string_array(np.asarray(values), skipna=True): | ||
values = np.array( | ||
[val for val in values if isinstance(val, str) or isna(val)], | ||
dtype=object, | ||
) | ||
if not len(values): | ||
return np.zeros(self.shape, dtype=bool) | ||
|
||
values = self._from_sequence(values, dtype=self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that we convert the input values
to a StringArray first, to make use of the construction validation (coercing to one missing value sentinel).
But, this is only done if we actually get strings passed, and if not, we filter out non-strings. This is also what is done in the ArrowStringArray
implementation, and this is to ensure we accept something like isin(["string", 10])
with mixed-type values (I do wonder if we should deprecate that in general, but that's for a later, separate discussion).
@@ -731,6 +733,22 @@ def _putmask(self, mask: npt.NDArray[np.bool_], value) -> None: | |||
# base class implementation that uses __setitem__ | |||
ExtensionArray._putmask(self, mask, value) | |||
|
|||
def isin(self, values: ArrayLike) -> npt.NDArray[np.bool_]: | |||
if not isinstance(values, BaseStringArray): | |||
if not lib.is_string_array(np.asarray(values), skipna=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth short-circuiting for the case where values
is all NA
and self._hasna
is False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it worth short-circuiting this check when the values.dtype
is ArrowDtype(pa.string())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth short-circuiting for the case where
values
is allNA
andself._hasna
isFalse
?
I would say that is probably not worth the added complexity for something that is just speeding up an uncommon corner case? (also, checking self._hasna
has a cost)
Also, is it worth short-circuiting this check when the
values.dtype
isArrowDtype(pa.string())
?
I wanted to do that, but then I would need to import ArrowDtype and pyarrow inline (because this file is for the object string dtype, that works without pyarrow), so at that point I was again unsure if we should add a special case here
(I think a typical use case is passing a list, and that will at the moment never be coerced to ArrowDtype before it gets here)
Now, what we should maybe do to essentially speed up this case is to ensure lib.is_string_array
shortcuts on pyarrow strings and directly returns True, and ensure that StringArray._from_sequence
shortcuts on pyarrow strings to ensure the most efficient conversion (if it doesn't do that already).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I can probably check for ArrowDtype(pa.string())
without importing it by comparing for equality with "string[pyarrow]"
Thanks @jorisvandenbossche |
…ev#59759) * String dtype: fix isin() values handling for python storage * address feedback
* String dtype: fix isin() values handling for python storage * address feedback
This PR adds a
StringArray.isin()
implementation (ArrowStringArray
already has a customisin()
), so we can add custom processing of the passedvalues
set compared to the base class implementation.See also the discussion at #58451 (comment) about the expected behaviour.
xref #54792