Skip to content

[ArrowStringArray] PERF: str.extract object fallback #41490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 17, 2021

Conversation

simonjayhawkins
Copy link
Member

       before           after         ratio
     [40482273]       [1788ffe6]
     <master>         <extract-single-group>
-         134±2ms          106±1ms     0.79  strings.Extract.time_extract_single_group('string', False)
-         359±5ms        110±0.8ms     0.31  strings.Extract.time_extract_single_group('arrow_string', False)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Strings String extension data type and string data labels May 15, 2021
Comment on lines 3040 to 3041
if not isinstance(x, str):
return np.nan
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we dispatch to array we can remove this isinstance check but will probably only make a small difference, maybe 2-3%

Comment on lines 3116 to 3123
mask = isna(arr)
result = lib.map_infer_mask(
np.asarray(arr), groups_or_na, mask.view(np.uint8), convert=False
)
# special case since extract on an object array converts None to np.nan
# will be handled by _str_map array method
if is_object_dtype(result_dtype):
result[mask] = np.nan
Copy link
Member Author

@simonjayhawkins simonjayhawkins May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is effectively the _str_map array method.

_str_map is not part of the BaseStringArrayMethods interface so shouldn't call it from here, but maybe no big deal since that is internal.

The _str_map method calls maybe_convert_objects so we need to add an extra keyword to so that we always return object (or string) arrays.

@simonjayhawkins simonjayhawkins marked this pull request as draft May 16, 2021 08:10
@simonjayhawkins
Copy link
Member Author

Have dumbed down the changes to just the materialization of the array. Iteration on a pyarow array is the main bottleneck here.

       before           after         ratio
     [40482273]       [79ff0d3e]
     <master>         <extract-single-group>
-         140±2ms          114±2ms     0.82  strings.Methods.time_extract('string')
-       134±0.9ms          107±2ms     0.80  strings.Extract.time_extract_single_group('string', False)
-         139±2ms        108±0.9ms     0.77  strings.Extract.time_extract_single_group('string', True)
-         356±3ms          118±2ms     0.33  strings.Extract.time_extract_single_group('arrow_string', False)
-        367±10ms          120±1ms     0.33  strings.Methods.time_extract('arrow_string')
-         360±2ms          112±1ms     0.31  strings.Extract.time_extract_single_group('arrow_string', True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@simonjayhawkins simonjayhawkins changed the title [ArrowStringArray] PERF: single regex group with expand=False [ArrowStringArray] PERF: str.extract object fallback May 16, 2021
@simonjayhawkins simonjayhawkins marked this pull request as ready for review May 16, 2021 19:57
@jreback jreback added this to the 1.3 milestone May 17, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase

@@ -248,6 +248,24 @@ def time_rsplit(self, dtype, expand):
self.s.str.rsplit("--", expand=expand)


class Extract:

params = (["str", "string", "arrow_string"], [True, False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob should make a base class to have these params

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and include setup there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea.

@jreback jreback merged commit dc48fb6 into pandas-dev:master May 17, 2021
@jreback
Copy link
Contributor

jreback commented May 17, 2021

very nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants