Skip to content

[ArrowStringArray] Use utf8_is_* functions from Apache Arrow if available #41041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

simonjayhawkins
Copy link
Member

xref xhochy/fletcher#203

marked as draft since AFAICT there is performance issues with BooleanDtype().from_arrow

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Apr 19, 2021
@jorisvandenbossche
Copy link
Member

Cool!

marked as draft since AFAICT there is performance issues with BooleanDtype().from_arrow

I opened #41051. In the end it's only a modest speed-up (3x), but I don't think we can improve further ourselves. Also compared to the actual string operation, I assume this conversion will be fast enough.

@simonjayhawkins
Copy link
Member Author

simonjayhawkins commented Apr 20, 2021

with the changes in #41051 now actually see an improvement.

s = pd.Series(["a", None, "1"] * 100_000, dtype="string")
s2 = pd.Series(["a", None, "1"] * 100_000, dtype="arrow_string")

%timeit s.str.isalnum()
# 19.6 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR/master


%timeit s2.str.isalnum()
# 22.9 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
# 1.9 ms ± 6.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- PR

@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche i'll leave this as draft till #41051 is merged.

@simonjayhawkins simonjayhawkins added the Performance Memory or execution speed performance label Apr 20, 2021
@jorisvandenbossche
Copy link
Member

I merged #41051, so you can update this now

@simonjayhawkins simonjayhawkins marked this pull request as ready for review April 22, 2021 13:57
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone Apr 22, 2021
@simonjayhawkins
Copy link
Member Author

I merged #41051, so you can update this now

Thanks @jorisvandenbossche

Have also paramaterised some more tests for the is_* functions. side effect is that, to avoid xfailing or breaking up a test, have also "fixed" extract/extractall xfailed tests. could break-off the test changes into a precursor if needed.

@simonjayhawkins
Copy link
Member Author

[  5.56%] ··· strings.Methods.time_isalnum                                                                                                                ok
[  5.56%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        19.3±0ms 
                  string      12.2±0ms 
               arrow_string   2.55±0ms 
              ============== ==========

[ 11.11%] ··· strings.Methods.time_isalpha                                                                                                                ok
[ 11.11%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.9±0ms 
                  string      10.9±0ms 
               arrow_string   3.00±0ms 
              ============== ==========

[ 16.67%] ··· strings.Methods.time_isdecimal                                                                                                              ok
[ 16.67%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.3±0ms 
                  string      8.88±0ms 
               arrow_string   1.76±0ms 
              ============== ==========

[ 22.22%] ··· strings.Methods.time_isdigit                                                                                                                ok
[ 22.22%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.3±0ms 
                  string      8.84±0ms 
               arrow_string   1.82±0ms 
              ============== ==========

[ 27.78%] ··· strings.Methods.time_islower                                                                                                                ok
[ 27.78%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        17.6±0ms 
                  string      10.8±0ms 
               arrow_string   3.67±0ms 
              ============== ==========

[ 33.33%] ··· strings.Methods.time_isnumeric                                                                                                              ok
[ 33.33%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        15.2±0ms 
                  string      8.84±0ms 
               arrow_string   1.78±0ms 
              ============== ==========

[ 38.89%] ··· strings.Methods.time_isspace                                                                                                                ok
[ 38.89%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        14.1±0ms 
                  string      8.41±0ms 
               arrow_string   1.74±0ms 
              ============== ==========

[ 44.44%] ··· strings.Methods.time_istitle                                                                                                                ok
[ 44.44%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        17.3±0ms 
                  string      10.8±0ms 
               arrow_string   3.38±0ms 
              ============== ==========

[ 50.00%] ··· strings.Methods.time_isupper                                                                                                                ok
[ 50.00%] ··· ============== ==========
                  dtype                
              -------------- ----------
                   str        16.0±0ms 
                  string      10.2±0ms 
               arrow_string   3.55±0ms 
              ============== ==========

@@ -758,6 +759,69 @@ def _str_map(self, f, na_value=None, dtype: Dtype | None = None):
# -> We don't know the result type. E.g. `.get` can return anything.
return lib.map_infer_mask(arr, f, mask.view("uint8"))

def _str_isalnum(self):
if hasattr(pc, "utf8_is_alnum"):
result = pc.utf8_is_alnum(self._data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point (not necessarily this PR), it might be worth benchmarking to see if calling pc.string_is_ascii first to then potentially use pc.ascii_is_alnum instead of pc.utf8_is_alnum could be worth it (which would be assuming that testing whether it's all ascii takes much less time than the benefit from using the faster ascii algorithm vs the utf8 one)

"""
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401

return request.param
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar fixture does not yet exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet. this is being added to a couple of PRs and can be promoted to a higher level conftest once one of the PRs has been merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good

@jorisvandenbossche
Copy link
Member

Looking good, thanks for the benchmarks!

}:
reason = "extract/extractall does not yet dispatch to array"
mark = pytest.mark.xfail(reason=reason)
request.node.add_marker(mark)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is extract fixed now? (but not in this PR?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests no longer fail. there is a change in this PR that "fixes" by special casing ArrowStringArray (like StringArray). extract/extractall will still need to be updated to dispatch to the array. not in this PR. see #41041 (comment)

the change to pandas/core/strings/accessor.py makes ArrowStringArray work like StringArray. If we don't add the change I would need to xfail test_empty_str_methods which totally defeats the purpose of parameterising the tests to get extra test coverage for the is_methods. The alternative is to split test_empty_str_methods and xfail the extract/extractall tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the explanation. No need to split off, I was just wondering how the changes in this PR "fixed" it ;)

@jorisvandenbossche
Copy link
Member

I don't expect any problem (since you check the presence of the attribute), but can you merge latest master (I just merged a PR to fix CI build with pyarrow 0.15-

@jorisvandenbossche jorisvandenbossche merged commit 44de181 into pandas-dev:master Apr 25, 2021
@jorisvandenbossche
Copy link
Member

Thanks!

@simonjayhawkins simonjayhawkins deleted the Use-is--functions-from-Arrow branch April 25, 2021 15:48
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants