Skip to content

[ArrowStringArray] PERF: bypass some padding code in _wrap_result #41567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 21, 2021

Conversation

simonjayhawkins
Copy link
Member

@simonjayhawkins simonjayhawkins commented May 19, 2021

The padding of np.nan across all columns is not needed for StringArray/ArrowStringArray

       before           after         ratio
     [896256ee]       [1c883e0d]
     <master>         <bypass>  
-      99.3±0.6ms         87.4±1ms     0.88  strings.Split.time_split('string', True)
-      89.6±0.3ms         78.7±1ms     0.88  strings.Split.time_rsplit('arrow_string', True)
-      71.3±0.5ms       58.7±0.8ms     0.82  strings.Split.time_rsplit('string', True)
-      60.0±0.4ms         49.1±1ms     0.82  strings.Methods.time_partition('arrow_string')
-      59.0±0.7ms         47.5±1ms     0.80  strings.Methods.time_rpartition('arrow_string')
-     50.5±0.08ms       38.3±0.4ms     0.76  strings.Methods.time_partition('string')
-      49.5±0.3ms       37.2±0.2ms     0.75  strings.Methods.time_rpartition('string')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

this is a small change for perf.

The padding code logic does not work for st.extract since the first value of a list maybe np.nan followed by other values. I made that change here and then reverted as is moreless perf neutral. so will do in another PR that uses _wrap_result more for str_extract. (str.extract returns list of lists where any value in the list could be np.nan, str.split returns array of lists with different lengths without np.nans and the np.nans returned as scalars)

some further gains can also be achieved by bypassing the converting of np.nan to [np.nan] for partition (for StringArray/ArrowStringArray) since the DataFrame constructor does not need this for a list of tuples. (str.partition returns array of tuples, str.split returns array of lists) #41507

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels May 19, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3 milestone May 19, 2021
@jreback jreback merged commit 34b0bdc into pandas-dev:master May 21, 2021
@simonjayhawkins simonjayhawkins deleted the bypass branch May 21, 2021 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants