Skip to content

[ArrowStringArray] REF: str.extract dispatch to array #41372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

simonjayhawkins
Copy link
Member

       before           after         ratio
     [170b4391]       [62cc9c38]
     <master>         <_str_extract>
-       138±0.4ms          117±2ms     0.85  strings.Methods.time_extract('string')
-      14.4±0.6ms      11.9±0.06ms     0.82  strings.Methods.time_isdigit('str')
-       109±0.4ms       81.1±0.9ms     0.74  strings.Split.time_split(True)
-      82.3±0.3ms       46.3±0.2ms     0.56  strings.Split.time_rsplit(True)
-      50.0±0.4ms       27.3±0.4ms     0.55  strings.Methods.time_partition('string')
-      49.0±0.2ms       26.4±0.4ms     0.54  strings.Methods.time_rpartition('string')
-      55.3±0.5ms       28.8±0.3ms     0.52  strings.Methods.time_partition('str')
-      54.8±0.4ms       27.8±0.2ms     0.51  strings.Methods.time_rpartition('str')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

marked as draft since it may be better to split the perf improvements from the dispatch to array refactor and also may be better to finish #41085 first.

also need to complete the parameterisation of the extract tests.

also maybe able to reduce the _wrap_result changes by passing a 2d array without expand kwarg so that expand is inferred and the code removed here will be bypassed. so will look into that before marking as ready for review.

if I can reduce the diff here (even if only by doing the test parameterisation as a precursor), then may be worth adding the pyarrow native implementation here also.

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label May 7, 2021
@simonjayhawkins simonjayhawkins marked this pull request as draft May 7, 2021 16:35
@simonjayhawkins
Copy link
Member Author

also maybe able to reduce the _wrap_result changes by passing a 2d array without expand kwarg so that expand is inferred and the code removed here will be bypassed. so will look into that before marking as ready for review.

doing this affects performance, will keep trying.

       before           after         ratio
     [40482273]       [75f9429f]
     <master>         <_str_extract>
+         142±2ms        250±0.7ms     1.76  strings.Methods.time_extract('string')
+       118±0.9ms        134±0.1ms     1.14  strings.Methods.time_extract('str')
-      13.6±0.7ms      11.1±0.08ms     0.82  strings.Methods.time_isspace('str')
-       358±0.7ms          251±1ms     0.70  strings.Methods.time_extract('arrow_string')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@simonjayhawkins
Copy link
Member Author

doing this affects performance, will keep trying.

IIRC, the latest commit showed improvements, but since there have been perf improvements on master #41490, we are seeing worse perf.

I doesn't seem clean, but the best way forward maybe to use python list comps for object dtypes instead of _str_map (which wraps map_infer_mask) or update _str_map to use list comps for object dtype.

       before           after         ratio
     [671cf86a]       [ffd7ee3b]
     <master>         <_str_extract>
+         117±2ms          201±2ms     1.72  strings.Methods.time_extract('string')
+         122±2ms          206±2ms     1.69  strings.Methods.time_extract('arrow_string')
+         112±2ms        135±0.3ms     1.20  strings.Methods.time_extract('str')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@simonjayhawkins
Copy link
Member Author

IIRC, the latest commit showed improvements

just to be clear, improvements over the previous commit, not master.

@jorisvandenbossche
Copy link
Member

@simonjayhawkins this PR doesn't add the use of a pyarrow.compute function? (I think extract_regex is the relevant one?)

@jorisvandenbossche
Copy link
Member

I haven't been fully following this PR up to now and the precursors you did, but can you try to explain again what the concern is here / why some benchmarks become slower (by refactoring part to dispatch to the array method)

@simonjayhawkins
Copy link
Member Author

@simonjayhawkins this PR doesn't add the use of a pyarrow.compute function? (I think extract_regex is the relevant one?)

see op. will maybe add here if diff is reduced.

I haven't been fully following this PR up to now and the precursors you did, but can you try to explain again what the concern is here / why some benchmarks become slower (by refactoring part to dispatch to the array method)

it's not the dispatch to array as such, but trying to use _str_map and _wrap_result cleanly and avoid union return types in the array function.

_str_map doesn't allow assigning a list as the scalar and is slower than a list comp
_wrap_result has some padding code that is needed for split that we need to bypass and also unlike other methods when expand is True, extract returns a DataFrame and not a MultiIndex, so we would probably need to special case _wrap_result anyway.

these DO NOT stop us dispatching to the array. probably the next PR now that #41539 is merged will do that.

@simonjayhawkins
Copy link
Member Author

I haven't been fully following this PR up to now and the precursors you did, but can you try to explain again what the concern is here / why some benchmarks become slower (by refactoring part to dispatch to the array method)

once #41541 is merged and _str_extract is moved to the array (following #41539 it's not much more than a move and add convert parameter to _str_map... unlike other string methods str.extract always returns a string-like array) the diff here should reduce considerably and the nits should be more obvious. As much as the union return type bugs me, a perf regression bugs me more.

@simonjayhawkins
Copy link
Member Author

these DO NOT stop us dispatching to the array. probably the next PR now that #41539 is merged will do that.

after #41541 is merged, to avoid conflicts.

@simonjayhawkins
Copy link
Member Author

after #41541 is merged, to avoid conflicts.

will close till then to help clear the queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants