[ArrowStringArray] Use `utf8_is_*` functions from Apache Arrow if available #41041

simonjayhawkins · 2021-04-19T13:47:43Z

marked as draft since AFAICT there is performance issues with BooleanDtype().from_arrow

…ilable

jorisvandenbossche · 2021-04-20T07:55:09Z

Cool!

marked as draft since AFAICT there is performance issues with BooleanDtype().from_arrow

I opened #41051. In the end it's only a modest speed-up (3x), but I don't think we can improve further ourselves. Also compared to the actual string operation, I assume this conversion will be fast enough.

…rray

simonjayhawkins · 2021-04-20T11:50:32Z

with the changes in #41051 now actually see an improvement.

s = pd.Series(["a", None, "1"] * 100_000, dtype="string")
s2 = pd.Series(["a", None, "1"] * 100_000, dtype="arrow_string")

%timeit s.str.isalnum()
# 19.6 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR/master


%timeit s2.str.isalnum()
# 22.9 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
# 1.9 ms ± 6.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- PR

simonjayhawkins · 2021-04-20T11:53:38Z

@jorisvandenbossche i'll leave this as draft till #41051 is merged.

jorisvandenbossche · 2021-04-21T16:10:15Z

I merged #41051, so you can update this now

…-from-Arrow

simonjayhawkins · 2021-04-22T14:05:29Z

I merged #41051, so you can update this now

Thanks @jorisvandenbossche

Have also paramaterised some more tests for the is_* functions. side effect is that, to avoid xfailing or breaking up a test, have also "fixed" extract/extractall xfailed tests. could break-off the test changes into a precursor if needed.

simonjayhawkins · 2021-04-22T21:09:31Z

[  5.56%] ··· strings.Methods.time_isalnum                                                                                                                ok
[  5.56%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        19.3±0ms 
                  string      12.2±0ms 
               arrow_string   2.55±0ms 
              ============== ==========

[ 11.11%] ··· strings.Methods.time_isalpha                                                                                                                ok
[ 11.11%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.9±0ms 
                  string      10.9±0ms 
               arrow_string   3.00±0ms 
              ============== ==========

[ 16.67%] ··· strings.Methods.time_isdecimal                                                                                                              ok
[ 16.67%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.3±0ms 
                  string      8.88±0ms 
               arrow_string   1.76±0ms 
              ============== ==========

[ 22.22%] ··· strings.Methods.time_isdigit                                                                                                                ok
[ 22.22%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.3±0ms 
                  string      8.84±0ms 
               arrow_string   1.82±0ms 
              ============== ==========

[ 27.78%] ··· strings.Methods.time_islower                                                                                                                ok
[ 27.78%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        17.6±0ms 
                  string      10.8±0ms 
               arrow_string   3.67±0ms 
              ============== ==========

[ 33.33%] ··· strings.Methods.time_isnumeric                                                                                                              ok
[ 33.33%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.2±0ms 
                  string      8.84±0ms 
               arrow_string   1.78±0ms 
              ============== ==========

[ 38.89%] ··· strings.Methods.time_isspace                                                                                                                ok
[ 38.89%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        14.1±0ms 
                  string      8.41±0ms 
               arrow_string   1.74±0ms 
              ============== ==========

[ 44.44%] ··· strings.Methods.time_istitle                                                                                                                ok
[ 44.44%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        17.3±0ms 
                  string      10.8±0ms 
               arrow_string   3.38±0ms 
              ============== ==========

[ 50.00%] ··· strings.Methods.time_isupper                                                                                                                ok
[ 50.00%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        16.0±0ms 
                  string      10.2±0ms 
               arrow_string   3.55±0ms 
              ============== ==========

jorisvandenbossche · 2021-04-22T21:23:29Z

pandas/core/arrays/string_arrow.py

@@ -758,6 +759,69 @@ def _str_map(self, f, na_value=None, dtype: Dtype | None = None):
            # -> We don't know the result type. E.g. `.get` can return anything.
            return lib.map_infer_mask(arr, f, mask.view("uint8"))

+    def _str_isalnum(self):
+        if hasattr(pc, "utf8_is_alnum"):
+            result = pc.utf8_is_alnum(self._data)


At some point (not necessarily this PR), it might be worth benchmarking to see if calling pc.string_is_ascii first to then potentially use pc.ascii_is_alnum instead of pc.utf8_is_alnum could be worth it (which would be assuming that testing whether it's all ascii takes much less time than the benefit from using the faster ascii algorithm vs the utf8 one)

jorisvandenbossche · 2021-04-22T21:24:20Z

pandas/tests/strings/test_strings.py

+    """
+    from pandas.core.arrays.string_arrow import ArrowStringDtype  # noqa: F401
+
+    return request.param


A similar fixture does not yet exist?

not yet. this is being added to a couple of PRs and can be promoted to a higher level conftest once one of the PRs has been merged.

OK, sounds good

jorisvandenbossche · 2021-04-22T21:25:11Z

Looking good, thanks for the benchmarks!

jorisvandenbossche · 2021-04-23T08:11:15Z

pandas/tests/strings/test_string_array.py

-    }:
-        reason = "extract/extractall does not yet dispatch to array"
-        mark = pytest.mark.xfail(reason=reason)
-        request.node.add_marker(mark)


Is extract fixed now? (but not in this PR?)

the tests no longer fail. there is a change in this PR that "fixes" by special casing ArrowStringArray (like StringArray). extract/extractall will still need to be updated to dispatch to the array. not in this PR. see #41041 (comment)

the change to pandas/core/strings/accessor.py makes ArrowStringArray work like StringArray. If we don't add the change I would need to xfail test_empty_str_methods which totally defeats the purpose of parameterising the tests to get extra test coverage for the is_methods. The alternative is to split test_empty_str_methods and xfail the extract/extractall tests

OK, thanks for the explanation. No need to split off, I was just wondering how the changes in this PR "fixed" it ;)

jorisvandenbossche · 2021-04-23T10:40:59Z

I don't expect any problem (since you check the presence of the attribute), but can you merge latest master (I just merged a PR to fix CI build with pyarrow 0.15-

…-from-Arrow

jorisvandenbossche · 2021-04-25T13:17:21Z

Thanks!

…ilable (pandas-dev#41041)

[ArrowStringArray] Use utf8_is_* functions from Apache Arrow if ava…

2e26503

…ilable

simonjayhawkins added the Strings String extension data type and string data label Apr 19, 2021

PERF: optimize conversion from boolean Arrow array to masked BooleanA…

0987b0e

…rray

simonjayhawkins added the Performance Memory or execution speed performance label Apr 20, 2021

simonjayhawkins added 2 commits April 22, 2021 10:28

Merge remote-tracking branch 'upstream/master' into Use-is--functions…

68ec0ba

…-from-Arrow

more testing

1842826

simonjayhawkins marked this pull request as ready for review April 22, 2021 13:57

simonjayhawkins added this to the 1.3 milestone Apr 22, 2021

add benchmarks

9e2c11b

jorisvandenbossche reviewed Apr 22, 2021

View reviewed changes

simonjayhawkins mentioned this pull request Apr 23, 2021

[ArrowStringArray] Use utf8_upper and utf8_lower functions from Apache Arrow #41056

Merged

jorisvandenbossche reviewed Apr 23, 2021

View reviewed changes

jorisvandenbossche approved these changes Apr 23, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into Use-is--functions…

7fb72f5

…-from-Arrow

jorisvandenbossche merged commit 44de181 into pandas-dev:master Apr 25, 2021

simonjayhawkins deleted the Use-is--functions-from-Arrow branch April 25, 2021 15:48

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021

[ArrowStringArray] Use utf8_is_* functions from Apache Arrow if ava…

0740197

…ilable (pandas-dev#41041)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

[ArrowStringArray] Use utf8_is_* functions from Apache Arrow if ava…

66310d0

…ilable (pandas-dev#41041)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ArrowStringArray] Use `utf8_is_*` functions from Apache Arrow if available #41041

[ArrowStringArray] Use `utf8_is_*` functions from Apache Arrow if available #41041

simonjayhawkins commented Apr 19, 2021

jorisvandenbossche commented Apr 20, 2021

simonjayhawkins commented Apr 20, 2021 •

edited

Loading

simonjayhawkins commented Apr 20, 2021

jorisvandenbossche commented Apr 21, 2021

simonjayhawkins commented Apr 22, 2021

simonjayhawkins commented Apr 22, 2021

jorisvandenbossche Apr 22, 2021

jorisvandenbossche Apr 22, 2021

simonjayhawkins Apr 23, 2021

jorisvandenbossche Apr 23, 2021

jorisvandenbossche commented Apr 22, 2021

jorisvandenbossche Apr 23, 2021

simonjayhawkins Apr 23, 2021

jorisvandenbossche Apr 23, 2021

jorisvandenbossche commented Apr 23, 2021

jorisvandenbossche commented Apr 25, 2021

[ArrowStringArray] Use utf8_is_* functions from Apache Arrow if available #41041

[ArrowStringArray] Use utf8_is_* functions from Apache Arrow if available #41041

Conversation

simonjayhawkins commented Apr 19, 2021

jorisvandenbossche commented Apr 20, 2021

simonjayhawkins commented Apr 20, 2021 • edited Loading

simonjayhawkins commented Apr 20, 2021

jorisvandenbossche commented Apr 21, 2021

simonjayhawkins commented Apr 22, 2021

simonjayhawkins commented Apr 22, 2021

jorisvandenbossche Apr 22, 2021

Choose a reason for hiding this comment

jorisvandenbossche Apr 22, 2021

Choose a reason for hiding this comment

simonjayhawkins Apr 23, 2021

Choose a reason for hiding this comment

jorisvandenbossche Apr 23, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 22, 2021

jorisvandenbossche Apr 23, 2021

Choose a reason for hiding this comment

simonjayhawkins Apr 23, 2021

Choose a reason for hiding this comment

jorisvandenbossche Apr 23, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 23, 2021

jorisvandenbossche commented Apr 25, 2021

[ArrowStringArray] Use `utf8_is_*` functions from Apache Arrow if available #41041

[ArrowStringArray] Use `utf8_is_*` functions from Apache Arrow if available #41041

simonjayhawkins commented Apr 20, 2021 •

edited

Loading