[ArrowStringArray] PERF: bypass some padding code in _wrap_result #41567

simonjayhawkins · 2021-05-19T12:07:18Z

The padding of np.nan across all columns is not needed for StringArray/ArrowStringArray

       before           after         ratio
     [896256ee]       [1c883e0d]
     <master>         <bypass>  
-      99.3±0.6ms         87.4±1ms     0.88  strings.Split.time_split('string', True)
-      89.6±0.3ms         78.7±1ms     0.88  strings.Split.time_rsplit('arrow_string', True)
-      71.3±0.5ms       58.7±0.8ms     0.82  strings.Split.time_rsplit('string', True)
-      60.0±0.4ms         49.1±1ms     0.82  strings.Methods.time_partition('arrow_string')
-      59.0±0.7ms         47.5±1ms     0.80  strings.Methods.time_rpartition('arrow_string')
-     50.5±0.08ms       38.3±0.4ms     0.76  strings.Methods.time_partition('string')
-      49.5±0.3ms       37.2±0.2ms     0.75  strings.Methods.time_rpartition('string')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

this is a small change for perf.

The padding code logic does not work for st.extract since the first value of a list maybe np.nan followed by other values. I made that change here and then reverted as is moreless perf neutral. so will do in another PR that uses _wrap_result more for str_extract. (str.extract returns list of lists where any value in the list could be np.nan, str.split returns array of lists with different lengths without np.nans and the np.nans returned as scalars)

some further gains can also be achieved by bypassing the converting of np.nan to [np.nan] for partition (for StringArray/ArrowStringArray) since the DataFrame constructor does not need this for a list of tuples. (str.partition returns array of tuples, str.split returns array of lists) #41507

This reverts commit 8e584a4.

…ndas-dev#41567)

simonjayhawkins added 3 commits May 19, 2021 11:26

[ArrowStringArray] PERF: bypass some padding code in _wrap_result

c700144

use mask

8e584a4

Revert "use mask"

1c883e0

This reverts commit 8e584a4.

simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels May 19, 2021

simonjayhawkins added this to the 1.3 milestone May 19, 2021

jreback merged commit 34b0bdc into pandas-dev:master May 21, 2021

simonjayhawkins deleted the bypass branch May 21, 2021 16:55

TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021

[ArrowStringArray] PERF: bypass some padding code in _wrap_result (pa…

2d7107c

…ndas-dev#41567)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

[ArrowStringArray] PERF: bypass some padding code in _wrap_result (pa…

d3720d0

…ndas-dev#41567)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ArrowStringArray] PERF: bypass some padding code in _wrap_result #41567

[ArrowStringArray] PERF: bypass some padding code in _wrap_result #41567

Uh oh!

simonjayhawkins commented May 19, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[ArrowStringArray] PERF: bypass some padding code in _wrap_result #41567

[ArrowStringArray] PERF: bypass some padding code in _wrap_result #41567

Uh oh!

Conversation

simonjayhawkins commented May 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

simonjayhawkins commented May 19, 2021 •

edited

Loading