Skip to content

REF: Dispatch string methods to ExtensionArray #36357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Sep 30, 2020

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 14, 2020

This is a large refactor of our string methods. The goal is to have the Series.str accessor dispatch the actual compute down to the ExtensionArray. The motivation is to allow a Series backed by Arrow memory to use the Arrow string kernel (e.g. Series.str.lower() on an Arrow-backed StringArray should eventually call pyarrow.compute.utf8_lower().)

To facilitate this, I made the following changes

  1. Split core/strings.py into a sub package:
  • core/strings/accessor.py: Implements the Series.str accessor, which (mostly) just delegates to the EA and wraps
  • core/strings/object_array.py: Implements the string methods for object-type ndarray.
  • core/strings/categorical.py, core/strings/string_array.py, implements categorical & StringArray-specific methods
  1. Defines a new ExtensionArray._str extension point. This is where EAs get to take over and use their compute

Note that there are a few methods like cat, extract, and extractall that don't yet dispatch. I need to look into them a bit more, there implementation is a bit confusing.

Closes #36216

@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data labels Sep 14, 2020
@jbrockmendel
Copy link
Member

Making sure I understand: core.strings.accessor is mostly a cut/paste of the existing core.strings? strings.object_array corresponds to the ndarray[object]-backed StringArray we use now?

Is strings.categorical for a Categorical backed by a StringArray or a StringArray backed by a Categorical? Something else?

Is the accessor central to what you're doing here, or could in principle be separate? I expect to have an opinion about this, but am not there yet.

@TomAugspurger
Copy link
Contributor Author

core.strings.accessor is mostly a cut/paste of the existing core.strings

Not quite. Anywhere you see things like

def method(self, ...):
    result = self._array._str.method(...)
    return self._wrap_result(result)

that result = self._array._str.method was previously something like map(self._array, func), or something like return str_method.

The core methods map_object and map_na which were previously in strings.py were moved to ObjectArrayMethods._map and StringArrayMethods._map.

Is strings.categorical for a Categorical backed by a StringArray or a StringArray backed by a Categorical? Something else?

It's for a supporting Series[Categorical].str and CategoricalIndex.str. Which I think is just a Categorical containing strings (object-dtype index for the categories).

Is the accessor central to what you're doing here, or could in principle be separate?

I assume you mean the _str accessor on ExtensionArrays, since that's what's new? It's central in the sense that it and inheritance is the mechanism I've used to dispatch. We could do it other ways.

@jbrockmendel
Copy link
Member

I assume you mean the _str accessor on ExtensionArrays, since that's what's new? It's central in the sense that it and inheritance is the mechanism I've used to dispatch. We could do it other ways.

In the past we've been very conservative about adding things to EA. I'd find this easier to think about if that could be considered orthogonally.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 17, 2020

In the past we've been very conservative about adding things to EA. I'd find this easier to think about if that could be considered orthogonally.

Just to clarify, I wouldn't consider this isn't part of the official EA interface (I'm not adding it to the docs, for example). If someone comes along with another EA for storing string arrays and want access to the pandas' .str namespace, then I'm happy to discuss adding it to the interface. Right now, we just have the 3-4ish internal users in pandas (object-dtype, string dtype, categorical, and soon ArrowStringArray).

We just need some way to translate Series.str.<method> to "call the right for a Series backed by this extension array". If you feel strongly about not adding this private attribute to StringArray and Categorical, then we can make a proxy similar to the ObjectProxy defined in object_array.py. It's just an unnecessary layer for the other cases, since we already have an extension array.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! This will nicely allow to start overriding a subset of the methods with arrow implementations where available for the ArrowStringArray.

I suppose one possible alternative to the "proxy" class holding the methods that is added as _str on the ExtensionArrays would be to add those methods directly on the EA. That's one layer of indirection less, but on the other hand also adds a whole bunch of methods to those classes of course (but it's similar to eg DatetimeArray holding the .dt methods/attributes, and the Methods classes could be turned into a mixin)

@TomAugspurger
Copy link
Contributor Author

So I think our two options are either

  1. Attach an accessor to string EAs
class BaseStringArrayMethods(abc.ABC):
     def __init__(self, array: ExtensionArray):
        self._array = array

class ObjectProxy(ExtensionArray):
   _str = CachedAccessor("str", ObjectArrayMethods)
  1. A mixin, where we prefex each method with some string (to disambiguate from regular methods of the same name).
class StringMethodsMixin:
    def _str_isnumber(self):
        ...

    def _str_repeat(self, n):
        ...

class StringArray(ExtensionArray, StringMethodsMixin):
    ...

Sounds like people are roughly prefer the second version?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do a precursor PR which just moves things. I have no idea what really changed here.

@TomAugspurger
Copy link
Contributor Author

Not easily, since it's more of a split than a move. Right now the pandas/core/strings.py contains

  1. The Series.str accessor (the class StringMethods and its function signatures)
  2. The implementations of these methods on arrays (the bodies of StringMethod's methods, all the top-level str_<method> methods, na_map, map_object, map_na.

This PR moves the StringMethods accessor to pandas/core/strings/accessor.py and the implementations to pandas/core/strings/{base,object_array,string_array,categorical}.py

@TomAugspurger
Copy link
Contributor Author

@ivanovmg your suggestions to change some of the implementations are fine, but probably not appropriate for this PR. Right now I'm trying to limit the changes to what's necessary for allowing EAs to controls the implementation, so not changing any of the implementations. Those would be good as followup work.

@@ -2316,6 +2316,9 @@ def replace(self, to_replace, value, inplace: bool = False):
# ------------------------------------------------------------------------
# String methods interface
def _str_map(self, f, na_value=np.nan, dtype=np.dtype(object)):
# Optimization to apply the callable `f` to the categories once
# and rebuild the result by `take`ing from the result with the codes.
# Returns the same type as the object-dtype impelmentation though.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Returns the same type as the object-dtype impelmentation though.
# Returns the same type as the object-dtype implementation though.

Most methods on the StringMethods accessor follow the pattern:

1. extract the array from the series (or index)
2. Call that array's impelmentation of the string method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Call that array's impelmentation of the string method
2. Call that array's implementation of the string method

@jorisvandenbossche
Copy link
Member

@TomAugspurger needs a merge of master

Otherwise remaining comments?

@TomAugspurger
Copy link
Contributor Author

Fixed the merge conflicts. This should be good on my end.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can pandas/core/strings.py be deleted?

na_value = self.dtype.na_value

mask = isna(self)
arr = self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

@TomAugspurger
Copy link
Contributor Author

can pandas/core/strings.py be deleted?

Yeah, apparently it was restored while I was fixing merge conflicts.

flags: int = 0,
na: Scalar = np.nan,
):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could # pragma: no cover all of these

@TomAugspurger
Copy link
Contributor Author

Just the ARM timeout. Anything else here?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some comments to create issues for followon

these are prefixed with `_str_`. So ``Series.str.upper()`` calls
``Series.array._str_upper()``. The interface isn't currently public
to other string extension arrays.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much more clear, thanks

return None


def _str_extract_noexpand(arr, pat, flags=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok fair enough, can you create an issue to separate these other methods (maybe with check boxes) & align witht he strutcure)

@@ -728,10 +728,6 @@ def test_count(self):
["foo", "foofoo", np.nan, "foooofooofommmfoo"], dtype=np.object_
)

result = strings.str_count(values, "f[o]+")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also create an issue to either make pandas/tests/strings/test_object.py or add a fixture to support th emultiple implementations (maybe better)

@jreback jreback added this to the 1.2 milestone Sep 30, 2020
@jreback jreback merged commit fbd13a9 into pandas-dev:master Sep 30, 2020
@jreback
Copy link
Contributor

jreback commented Sep 30, 2020

thanks @TomAugspurger really nice. sorry for the long review time :->

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 30, 2020 via email

@TomAugspurger TomAugspurger deleted the dispatch-string-methods branch October 14, 2020 14:42
@xhochy
Copy link
Contributor

xhochy commented Oct 15, 2020

FYI, I used this PR in fletcher xhochy/fletcher#196 and it works quite nicely. Will now use that to add string algorithms that are now in Arrow bit-by-bit.

@TomAugspurger
Copy link
Contributor Author

@xhochy Cool! If fletcher is using it then we should probably add it to the public API. Probably just importing it in pandas.api.extensions.

@jorisvandenbossche
Copy link
Member

Unless we only want fletcher to use it for now .. (which still keeps it easier to change the implementation, given the good ties with the fletcher developers?)

@xhochy
Copy link
Contributor

xhochy commented Oct 15, 2020

I'm going to use it a bit more and see if I run into troubles while overloading it with the Arrow methods. If there are issues, we can easier to breaking changes as long as it is a private API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor StringMethods for extension arrays
7 participants