-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: [ArrowStringArray] Enable the string methods for the arrow-backed StringArray #40708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: [ArrowStringArray] Enable the string methods for the arrow-backed StringArray #40708
Conversation
pandas/_libs/lib.pyx
Outdated
@@ -1110,6 +1110,7 @@ _TYPE_MAP = { | |||
"complex128": "complex", | |||
"c": "complex", | |||
"string": "string", | |||
"arrow_string": "string", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR as draft. This will be resolved in a follow-up to #40679. #40679 (comment)
the inference is needed so that the str accessor works.
@@ -43,6 +47,7 @@ | |||
check_array_indexer, | |||
validate_indices, | |||
) | |||
from pandas.core.strings.object_array import ObjectStringArrayMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the goal is to inherit from BaseStringArrayMethods. (I recall this being mentioned somewhere). For now use ObjectStringArrayMixin similar to fletcher xhochy/fletcher#196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because you want to have the object-dtype based methods as fallback for the ones that pyarrow doesn't yet support, I suppose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. This PR just gets the string methods working for the existing tests we have StringArray. So just converting to object and not using native pyarrow functions yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small comment about that (eg below where the class is created) about why ObjectStringArrayMixin
is mixed in?
|
||
_str_na_value = ArrowStringDtype.na_value | ||
|
||
def _str_map(self, f, na_value=None, dtype: Dtype | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moreless cut and paste from StringArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be shared? (eg move it a common helper function or mixin?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed. I think the de-duplication is better as an immediate follow-up to keep changes to existing code paths (i.e. StringArray) in a separate PR and keep this one scoped to just additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either is fine for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eg move it a common helper function or mixin
or have a common base class. #35169 (comment)
so better as a follow-on to allow for more discussion
@@ -316,7 +317,7 @@ def cons_row(x): | |||
# This is a mess. | |||
dtype: Optional[str] | |||
if self._is_string and returns_string: | |||
dtype = "string" | |||
dtype = self._orig.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
special case here for partition/rpartition/split where the array method returns an object array
if nullable_string_dtype == "arrow_string" and method_name in { | ||
"extract", | ||
"extractall", | ||
}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could either special case, refactor to dispatch to array or leave as follow-up and xfail for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the xfail is fine for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. keeps the scope of this PR limited.
@@ -73,33 +81,38 @@ def test_string_array_numeric_integer_array(method, expected): | |||
("isdigit", [False, None, True]), | |||
("isalpha", [True, None, False]), | |||
("isalnum", [True, None, True]), | |||
("isdigit", [False, None, True]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another dup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can add "isnumeric" instead, which doesn't seem to be tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. "isnumeric" and other "is" functions are tested to some degree by _any_string_method fixture used in test_string_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me, a few comments
@@ -43,6 +47,7 @@ | |||
check_array_indexer, | |||
validate_indices, | |||
) | |||
from pandas.core.strings.object_array import ObjectStringArrayMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because you want to have the object-dtype based methods as fallback for the ones that pyarrow doesn't yet support, I suppose?
|
||
_str_na_value = ArrowStringDtype.na_value | ||
|
||
def _str_map(self, f, na_value=None, dtype: Dtype | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be shared? (eg move it a common helper function or mixin?)
if nullable_string_dtype == "arrow_string" and method_name in { | ||
"extract", | ||
"extractall", | ||
}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the xfail is fine for now
@@ -73,33 +81,38 @@ def test_string_array_numeric_integer_array(method, expected): | |||
("isdigit", [False, None, True]), | |||
("isalpha", [True, None, False]), | |||
("isalnum", [True, None, True]), | |||
("isdigit", [False, None, True]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe can add "isnumeric" instead, which doesn't seem to be tested
@simonjayhawkins is this still draft? (it seems ready to me) |
this one goes in after #40725, #40708 (comment), over 80% complete on the benchmarks. |
To be able to move forward with the different PRs, I think it is perfectly fine to merge this with the |
@simonjayhawkins this is ready for review/merge now? |
hopefully, will mark as ready for review once ci is green. |
@jorisvandenbossche green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do one more merge of master, and then can merge on green?
You are planning to do a follow-up then to reduce the duplication?
@@ -43,6 +47,7 @@ | |||
check_array_indexer, | |||
validate_indices, | |||
) | |||
from pandas.core.strings.object_array import ObjectStringArrayMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small comment about that (eg below where the class is created) about why ObjectStringArrayMixin
is mixed in?
@jorisvandenbossche green. ok to merge? |
Yep, thanks! |
from #35169 (comment)