ENH: [ArrowStringArray] Enable the string methods for the arrow-backed StringArray #40708

simonjayhawkins · 2021-03-31T17:47:30Z

Enable the string methods for the arrow-backed StringArray. This might also need some additional changes in the string accessor code (eg to dispatch extract to the underlying array as well)

simonjayhawkins · 2021-03-31T17:49:13Z

pandas/_libs/lib.pyx

@@ -1110,6 +1110,7 @@ _TYPE_MAP = {
    "complex128": "complex",
    "c": "complex",
    "string": "string",
+    "arrow_string": "string",


PR as draft. This will be resolved in a follow-up to #40679. #40679 (comment)

the inference is needed so that the str accessor works.

pandas/conftest.py

simonjayhawkins · 2021-03-31T17:54:32Z

pandas/core/arrays/string_arrow.py

@@ -43,6 +47,7 @@
    check_array_indexer,
    validate_indices,
 )
+from pandas.core.strings.object_array import ObjectStringArrayMixin


I think the goal is to inherit from BaseStringArrayMethods. (I recall this being mentioned somewhere). For now use ObjectStringArrayMixin similar to fletcher xhochy/fletcher#196

This is because you want to have the object-dtype based methods as fallback for the ones that pyarrow doesn't yet support, I suppose?

yep. This PR just gets the string methods working for the existing tests we have StringArray. So just converting to object and not using native pyarrow functions yet.

Can you add a small comment about that (eg below where the class is created) about why ObjectStringArrayMixin is mixed in?

simonjayhawkins · 2021-03-31T17:55:55Z

pandas/core/arrays/string_arrow.py

+
+    _str_na_value = ArrowStringDtype.na_value
+
+    def _str_map(self, f, na_value=None, dtype: Dtype | None = None):


moreless cut and paste from StringArray.

This could be shared? (eg move it a common helper function or mixin?)

indeed. I think the de-duplication is better as an immediate follow-up to keep changes to existing code paths (i.e. StringArray) in a separate PR and keep this one scoped to just additions.

either is fine for me

eg move it a common helper function or mixin

or have a common base class. #35169 (comment)

so better as a follow-on to allow for more discussion

simonjayhawkins · 2021-03-31T17:57:00Z

pandas/core/strings/accessor.py

@@ -316,7 +317,7 @@ def cons_row(x):
            # This is a mess.
            dtype: Optional[str]
            if self._is_string and returns_string:
-                dtype = "string"
+                dtype = self._orig.dtype


special case here for partition/rpartition/split where the array method returns an object array

pandas/tests/arrays/string_/test_string.py

simonjayhawkins · 2021-03-31T17:58:34Z

pandas/tests/strings/test_string_array.py

+    if nullable_string_dtype == "arrow_string" and method_name in {
+        "extract",
+        "extractall",
+    }:


could either special case, refactor to dispatch to array or leave as follow-up and xfail for now.

I think the xfail is fine for now

agreed. keeps the scope of this PR limited.

simonjayhawkins · 2021-03-31T17:59:28Z

pandas/tests/strings/test_string_array.py

@@ -73,33 +81,38 @@ def test_string_array_numeric_integer_array(method, expected):
        ("isdigit", [False, None, True]),
        ("isalpha", [True, None, False]),
        ("isalnum", [True, None, True]),
-        ("isdigit", [False, None, True]),


another dup

Maybe can add "isnumeric" instead, which doesn't seem to be tested

sure. "isnumeric" and other "is" functions are tested to some degree by _any_string_method fixture used in test_string_array

jorisvandenbossche

Generally looks good to me, a few comments

jorisvandenbossche · 2021-04-01T13:48:22Z

pandas/core/arrays/string_arrow.py

@@ -43,6 +47,7 @@
    check_array_indexer,
    validate_indices,
 )
+from pandas.core.strings.object_array import ObjectStringArrayMixin


This is because you want to have the object-dtype based methods as fallback for the ones that pyarrow doesn't yet support, I suppose?

jorisvandenbossche · 2021-04-01T13:51:13Z

pandas/core/arrays/string_arrow.py

+
+    _str_na_value = ArrowStringDtype.na_value
+
+    def _str_map(self, f, na_value=None, dtype: Dtype | None = None):


This could be shared? (eg move it a common helper function or mixin?)

pandas/tests/arrays/string_/test_string.py

jorisvandenbossche · 2021-04-01T14:04:35Z

pandas/tests/strings/test_string_array.py

+    if nullable_string_dtype == "arrow_string" and method_name in {
+        "extract",
+        "extractall",
+    }:


I think the xfail is fine for now

jorisvandenbossche · 2021-04-01T14:09:22Z

pandas/tests/strings/test_string_array.py

@@ -73,33 +81,38 @@ def test_string_array_numeric_integer_array(method, expected):
        ("isdigit", [False, None, True]),
        ("isalpha", [True, None, False]),
        ("isalnum", [True, None, True]),
-        ("isdigit", [False, None, True]),


Maybe can add "isnumeric" instead, which doesn't seem to be tested

jorisvandenbossche · 2021-04-02T13:15:35Z

@simonjayhawkins is this still draft? (it seems ready to me)

simonjayhawkins · 2021-04-02T14:00:30Z

this one goes in after #40725, #40708 (comment), over 80% complete on the benchmarks.

jorisvandenbossche · 2021-04-08T08:48:29Z

To be able to move forward with the different PRs, I think it is perfectly fine to merge this with the "arrow_string": "string", in infer_dtype (which is the name of the dtype on current master), and then this can be updated when the name gets changed or in #40725

jorisvandenbossche · 2021-04-09T11:56:17Z

@simonjayhawkins this is ready for review/merge now?

simonjayhawkins · 2021-04-09T12:06:56Z

hopefully, will mark as ready for review once ci is green.

simonjayhawkins · 2021-04-09T12:22:31Z

@jorisvandenbossche green

jorisvandenbossche

Can you do one more merge of master, and then can merge on green?

You are planning to do a follow-up then to reduce the duplication?

jorisvandenbossche · 2021-04-15T07:09:26Z

pandas/core/arrays/string_arrow.py

@@ -43,6 +47,7 @@
    check_array_indexer,
    validate_indices,
 )
+from pandas.core.strings.object_array import ObjectStringArrayMixin


Can you add a small comment about that (eg below where the class is created) about why ObjectStringArrayMixin is mixed in?

simonjayhawkins · 2021-04-15T08:44:47Z

@jorisvandenbossche green. ok to merge?

jorisvandenbossche · 2021-04-15T08:50:35Z

Yep, thanks!

…d StringArray (pandas-dev#40708)

simonjayhawkins added 8 commits March 31, 2021 10:44

ArrowStringArrayMixin

9b6c054

move _str_map to ArrowStringArray

c0fedcd

test_str_get_stringarray_multiple_nans

e603321

flake code

37035b4

test assertions

c823953

remove xfail from test_string_methods

76292ec

xfail extract/extractall tests - out of scope for this PR

50db876

special case in _wrap_result

374924b

simonjayhawkins added Enhancement Strings String extension data type and string data labels Mar 31, 2021

simonjayhawkins added this to the 1.3 milestone Mar 31, 2021

simonjayhawkins commented Mar 31, 2021

View reviewed changes

pandas/conftest.py Show resolved Hide resolved

simonjayhawkins commented Mar 31, 2021

View reviewed changes

pandas/tests/arrays/string_/test_string.py Show resolved Hide resolved

simonjayhawkins commented Mar 31, 2021

View reviewed changes

jorisvandenbossche reviewed Apr 1, 2021

View reviewed changes

simonjayhawkins added 2 commits April 1, 2021 16:10

Merge remote-tracking branch 'upstream/master' into str-accessor

23b40e3

add isnumeric to test_string_array_boolean_array

4b86f67

simonjayhawkins mentioned this pull request Apr 2, 2021

DOC: [ArrowStringArray] release note and other documentation #40747

Closed

simonjayhawkins added 2 commits April 9, 2021 11:35

Merge remote-tracking branch 'upstream/master' into str-accessor

7bd82f9

mypy fixup

aaf54ca

simonjayhawkins marked this pull request as ready for review April 9, 2021 12:22

simonjayhawkins added 2 commits April 13, 2021 14:11

Merge remote-tracking branch 'upstream/master' into str-accessor

b1cf83d

Merge remote-tracking branch 'upstream/master' into str-accessor

36d0034

jorisvandenbossche approved these changes Apr 15, 2021

View reviewed changes

simonjayhawkins added 2 commits April 15, 2021 08:26

Merge remote-tracking branch 'upstream/master' into str-accessor

f479819

add comments

d1c8a3e

jorisvandenbossche merged commit 1b48287 into pandas-dev:master Apr 15, 2021

simonjayhawkins deleted the str-accessor branch April 15, 2021 08:51

This was referenced Apr 15, 2021

[ArrowStringArray] CLN: move and rename test_string_methods #40960

Merged

[ArrowStringArray] API: StringArray -> ObjectStringArray #40962

Closed

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request Apr 21, 2021

ENH: [ArrowStringArray] Enable the string methods for the arrow-backe…

428ae90

…d StringArray (pandas-dev#40708)

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

ENH: [ArrowStringArray] Enable the string methods for the arrow-backe…

07d8ee3

…d StringArray (pandas-dev#40708)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

ENH: [ArrowStringArray] Enable the string methods for the arrow-backe…

d55790c

…d StringArray (pandas-dev#40708)

rhshadrach mentioned this pull request May 18, 2025

BUG: documented usage of of str.split(...).str.get fails on dtype large_string[pyarrow] #61431

Open

3 tasks


		_str_na_value = ArrowStringDtype.na_value

		def _str_map(self, f, na_value=None, dtype: Dtype \| None = None):

Uh oh!

ENH: [ArrowStringArray] Enable the string methods for the arrow-backed StringArray #40708

ENH: [ArrowStringArray] Enable the string methods for the arrow-backed StringArray #40708

Uh oh!

Conversation

simonjayhawkins commented Mar 31, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 2, 2021

Uh oh!

simonjayhawkins commented Apr 2, 2021

Uh oh!

jorisvandenbossche commented Apr 8, 2021

Uh oh!

jorisvandenbossche commented Apr 9, 2021

Uh oh!

simonjayhawkins commented Apr 9, 2021

Uh oh!

simonjayhawkins commented Apr 9, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins commented Apr 15, 2021

Uh oh!

jorisvandenbossche commented Apr 15, 2021

Uh oh!

Uh oh!