Skip to content

ENH: [ArrowStringArray] Enable the string methods for the arrow-backed StringArray #40708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 15, 2021

Conversation

simonjayhawkins
Copy link
Member

from #35169 (comment)

Enable the string methods for the arrow-backed StringArray. This might also need some additional changes in the string accessor code (eg to dispatch extract to the underlying array as well)

@simonjayhawkins simonjayhawkins added Enhancement Strings String extension data type and string data labels Mar 31, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Mar 31, 2021
@@ -1110,6 +1110,7 @@ _TYPE_MAP = {
"complex128": "complex",
"c": "complex",
"string": "string",
"arrow_string": "string",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR as draft. This will be resolved in a follow-up to #40679. #40679 (comment)

the inference is needed so that the str accessor works.

@@ -43,6 +47,7 @@
check_array_indexer,
validate_indices,
)
from pandas.core.strings.object_array import ObjectStringArrayMixin
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal is to inherit from BaseStringArrayMethods. (I recall this being mentioned somewhere). For now use ObjectStringArrayMixin similar to fletcher xhochy/fletcher#196

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because you want to have the object-dtype based methods as fallback for the ones that pyarrow doesn't yet support, I suppose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. This PR just gets the string methods working for the existing tests we have StringArray. So just converting to object and not using native pyarrow functions yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment about that (eg below where the class is created) about why ObjectStringArrayMixin is mixed in?


_str_na_value = ArrowStringDtype.na_value

def _str_map(self, f, na_value=None, dtype: Dtype | None = None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moreless cut and paste from StringArray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be shared? (eg move it a common helper function or mixin?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. I think the de-duplication is better as an immediate follow-up to keep changes to existing code paths (i.e. StringArray) in a separate PR and keep this one scoped to just additions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either is fine for me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg move it a common helper function or mixin

or have a common base class. #35169 (comment)

so better as a follow-on to allow for more discussion

@@ -316,7 +317,7 @@ def cons_row(x):
# This is a mess.
dtype: Optional[str]
if self._is_string and returns_string:
dtype = "string"
dtype = self._orig.dtype
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special case here for partition/rpartition/split where the array method returns an object array

if nullable_string_dtype == "arrow_string" and method_name in {
"extract",
"extractall",
}:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could either special case, refactor to dispatch to array or leave as follow-up and xfail for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the xfail is fine for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. keeps the scope of this PR limited.

@@ -73,33 +81,38 @@ def test_string_array_numeric_integer_array(method, expected):
("isdigit", [False, None, True]),
("isalpha", [True, None, False]),
("isalnum", [True, None, True]),
("isdigit", [False, None, True]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another dup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can add "isnumeric" instead, which doesn't seem to be tested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. "isnumeric" and other "is" functions are tested to some degree by _any_string_method fixture used in test_string_array

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, a few comments

@@ -43,6 +47,7 @@
check_array_indexer,
validate_indices,
)
from pandas.core.strings.object_array import ObjectStringArrayMixin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because you want to have the object-dtype based methods as fallback for the ones that pyarrow doesn't yet support, I suppose?


_str_na_value = ArrowStringDtype.na_value

def _str_map(self, f, na_value=None, dtype: Dtype | None = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be shared? (eg move it a common helper function or mixin?)

if nullable_string_dtype == "arrow_string" and method_name in {
"extract",
"extractall",
}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the xfail is fine for now

@@ -73,33 +81,38 @@ def test_string_array_numeric_integer_array(method, expected):
("isdigit", [False, None, True]),
("isalpha", [True, None, False]),
("isalnum", [True, None, True]),
("isdigit", [False, None, True]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe can add "isnumeric" instead, which doesn't seem to be tested

@jorisvandenbossche
Copy link
Member

@simonjayhawkins is this still draft? (it seems ready to me)

@simonjayhawkins
Copy link
Member Author

this one goes in after #40725, #40708 (comment), over 80% complete on the benchmarks.

@jorisvandenbossche
Copy link
Member

To be able to move forward with the different PRs, I think it is perfectly fine to merge this with the "arrow_string": "string", in infer_dtype (which is the name of the dtype on current master), and then this can be updated when the name gets changed or in #40725

@jorisvandenbossche
Copy link
Member

@simonjayhawkins this is ready for review/merge now?

@simonjayhawkins
Copy link
Member Author

hopefully, will mark as ready for review once ci is green.

@simonjayhawkins simonjayhawkins marked this pull request as ready for review April 9, 2021 12:22
@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche green

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do one more merge of master, and then can merge on green?

You are planning to do a follow-up then to reduce the duplication?

@@ -43,6 +47,7 @@
check_array_indexer,
validate_indices,
)
from pandas.core.strings.object_array import ObjectStringArrayMixin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment about that (eg below where the class is created) about why ObjectStringArrayMixin is mixed in?

@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche green. ok to merge?

@jorisvandenbossche jorisvandenbossche merged commit 1b48287 into pandas-dev:master Apr 15, 2021
@jorisvandenbossche
Copy link
Member

Yep, thanks!

@simonjayhawkins simonjayhawkins deleted the str-accessor branch April 15, 2021 08:51
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants