-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
String dtype: fix isin() values handling for python storage #59759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,6 +46,7 @@ | |
nanops, | ||
ops, | ||
) | ||
from pandas.core.algorithms import isin | ||
from pandas.core.array_algos import masked_reductions | ||
from pandas.core.arrays.base import ExtensionArray | ||
from pandas.core.arrays.floating import ( | ||
|
@@ -65,6 +66,7 @@ | |
import pyarrow | ||
|
||
from pandas._typing import ( | ||
ArrayLike, | ||
AxisInt, | ||
Dtype, | ||
DtypeObj, | ||
|
@@ -735,6 +737,24 @@ def _putmask(self, mask: npt.NDArray[np.bool_], value) -> None: | |
# base class implementation that uses __setitem__ | ||
ExtensionArray._putmask(self, mask, value) | ||
|
||
def isin(self, values: ArrayLike) -> npt.NDArray[np.bool_]: | ||
if isinstance(values, BaseStringArray) or ( | ||
isinstance(values, ExtensionArray) and is_string_dtype(values.dtype) | ||
): | ||
values = values.astype(self.dtype, copy=False) | ||
else: | ||
if not lib.is_string_array(np.asarray(values), skipna=True): | ||
values = np.array( | ||
[val for val in values if isinstance(val, str) or isna(val)], | ||
dtype=object, | ||
) | ||
if not len(values): | ||
return np.zeros(self.shape, dtype=bool) | ||
|
||
values = self._from_sequence(values, dtype=self.dtype) | ||
Comment on lines
+746
to
+754
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This ensures that we convert the input |
||
|
||
return isin(np.asarray(self), np.asarray(values)) | ||
|
||
def astype(self, dtype, copy: bool = True): | ||
dtype = pandas_dtype(dtype) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth short-circuiting for the case where
values
is allNA
andself._hasna
isFalse
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it worth short-circuiting this check when the
values.dtype
isArrowDtype(pa.string())
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that is probably not worth the added complexity for something that is just speeding up an uncommon corner case? (also, checking
self._hasna
has a cost)I wanted to do that, but then I would need to import ArrowDtype and pyarrow inline (because this file is for the object string dtype, that works without pyarrow), so at that point I was again unsure if we should add a special case here
(I think a typical use case is passing a list, and that will at the moment never be coerced to ArrowDtype before it gets here)
Now, what we should maybe do to essentially speed up this case is to ensure
lib.is_string_array
shortcuts on pyarrow strings and directly returns True, and ensure thatStringArray._from_sequence
shortcuts on pyarrow strings to ensure the most efficient conversion (if it doesn't do that already).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I can probably check for
ArrowDtype(pa.string())
without importing it by comparing for equality with"string[pyarrow]"