-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: Dispatch string methods to ExtensionArray #36357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Dispatch string methods to ExtensionArray #36357
Conversation
Making sure I understand: core.strings.accessor is mostly a cut/paste of the existing core.strings? strings.object_array corresponds to the ndarray[object]-backed StringArray we use now? Is strings.categorical for a Categorical backed by a StringArray or a StringArray backed by a Categorical? Something else? Is the accessor central to what you're doing here, or could in principle be separate? I expect to have an opinion about this, but am not there yet. |
Not quite. Anywhere you see things like def method(self, ...):
result = self._array._str.method(...)
return self._wrap_result(result) that The core methods
It's for a supporting
I assume you mean the |
In the past we've been very conservative about adding things to EA. I'd find this easier to think about if that could be considered orthogonally. |
Just to clarify, I wouldn't consider this isn't part of the official EA interface (I'm not adding it to the docs, for example). If someone comes along with another EA for storing string arrays and want access to the pandas' We just need some way to translate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! This will nicely allow to start overriding a subset of the methods with arrow implementations where available for the ArrowStringArray.
I suppose one possible alternative to the "proxy" class holding the methods that is added as _str
on the ExtensionArrays would be to add those methods directly on the EA. That's one layer of indirection less, but on the other hand also adds a whole bunch of methods to those classes of course (but it's similar to eg DatetimeArray holding the .dt methods/attributes, and the Methods classes could be turned into a mixin)
So I think our two options are either
class BaseStringArrayMethods(abc.ABC):
def __init__(self, array: ExtensionArray):
self._array = array
class ObjectProxy(ExtensionArray):
_str = CachedAccessor("str", ObjectArrayMethods)
class StringMethodsMixin:
def _str_isnumber(self):
...
def _str_repeat(self, n):
...
class StringArray(ExtensionArray, StringMethodsMixin):
... Sounds like people are roughly prefer the second version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do a precursor PR which just moves things. I have no idea what really changed here.
Not easily, since it's more of a split than a move. Right now the
This PR moves the |
@ivanovmg your suggestions to change some of the implementations are fine, but probably not appropriate for this PR. Right now I'm trying to limit the changes to what's necessary for allowing EAs to controls the implementation, so not changing any of the implementations. Those would be good as followup work. |
pandas/core/arrays/categorical.py
Outdated
@@ -2316,6 +2316,9 @@ def replace(self, to_replace, value, inplace: bool = False): | |||
# ------------------------------------------------------------------------ | |||
# String methods interface | |||
def _str_map(self, f, na_value=np.nan, dtype=np.dtype(object)): | |||
# Optimization to apply the callable `f` to the categories once | |||
# and rebuild the result by `take`ing from the result with the codes. | |||
# Returns the same type as the object-dtype impelmentation though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Returns the same type as the object-dtype impelmentation though. | |
# Returns the same type as the object-dtype implementation though. |
pandas/core/strings/__init__.py
Outdated
Most methods on the StringMethods accessor follow the pattern: | ||
|
||
1. extract the array from the series (or index) | ||
2. Call that array's impelmentation of the string method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Call that array's impelmentation of the string method | |
2. Call that array's implementation of the string method |
@TomAugspurger needs a merge of master Otherwise remaining comments? |
Fixed the merge conflicts. This should be good on my end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can pandas/core/strings.py be deleted?
pandas/core/arrays/string_.py
Outdated
na_value = self.dtype.na_value | ||
|
||
mask = isna(self) | ||
arr = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed?
Yeah, apparently it was restored while I was fixing merge conflicts. |
flags: int = 0, | ||
na: Scalar = np.nan, | ||
): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could # pragma: no cover
all of these
Just the ARM timeout. Anything else here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some comments to create issues for followon
these are prefixed with `_str_`. So ``Series.str.upper()`` calls | ||
``Series.array._str_upper()``. The interface isn't currently public | ||
to other string extension arrays. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much more clear, thanks
return None | ||
|
||
|
||
def _str_extract_noexpand(arr, pat, flags=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok fair enough, can you create an issue to separate these other methods (maybe with check boxes) & align witht he strutcure)
@@ -728,10 +728,6 @@ def test_count(self): | |||
["foo", "foofoo", np.nan, "foooofooofommmfoo"], dtype=np.object_ | |||
) | |||
|
|||
result = strings.str_count(values, "f[o]+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also create an issue to either make pandas/tests/strings/test_object.py or add a fixture to support th emultiple implementations (maybe better)
thanks @TomAugspurger really nice. sorry for the long review time :-> |
Thanks, no worries. This wasn't especially easy to review since it was a
large diff that didn't just move stuff around.
…On Wed, Sep 30, 2020 at 7:22 AM Jeff Reback ***@***.***> wrote:
thanks @TomAugspurger <https://github.com/TomAugspurger> really nice.
sorry for the long review time :->
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#36357 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIUPLTPQTZCQFFTSW5LSIMPKFANCNFSM4RLS5ONA>
.
|
FYI, I used this PR in |
@xhochy Cool! If fletcher is using it then we should probably add it to the public API. Probably just importing it in |
Unless we only want fletcher to use it for now .. (which still keeps it easier to change the implementation, given the good ties with the fletcher developers?) |
I'm going to use it a bit more and see if I run into troubles while overloading it with the Arrow methods. If there are issues, we can easier to breaking changes as long as it is a private API. |
This is a large refactor of our string methods. The goal is to have the
Series.str
accessor dispatch the actual compute down to the ExtensionArray. The motivation is to allow a Series backed by Arrow memory to use the Arrow string kernel (e.g.Series.str.lower()
on an Arrow-backed StringArray should eventually callpyarrow.compute.utf8_lower()
.)To facilitate this, I made the following changes
core/strings.py
into a sub package:core/strings/accessor.py
: Implements theSeries.str
accessor, which (mostly) just delegates to the EA and wrapscore/strings/object_array.py
: Implements the string methods for object-type ndarray.core/strings/categorical.py
,core/strings/string_array.py
, implements categorical & StringArray-specific methodsExtensionArray._str
extension point. This is where EAs get to take over and use their computeNote that there are a few methods like
cat
,extract
, andextractall
that don't yet dispatch. I need to look into them a bit more, there implementation is a bit confusing.Closes #36216