-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[ArrowStringArray] PERF: str.extract object fallback #41490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrowStringArray] PERF: str.extract object fallback #41490
Conversation
simonjayhawkins
commented
May 15, 2021
pandas/core/strings/accessor.py
Outdated
if not isinstance(x, str): | ||
return np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we dispatch to array we can remove this isinstance check but will probably only make a small difference, maybe 2-3%
pandas/core/strings/accessor.py
Outdated
mask = isna(arr) | ||
result = lib.map_infer_mask( | ||
np.asarray(arr), groups_or_na, mask.view(np.uint8), convert=False | ||
) | ||
# special case since extract on an object array converts None to np.nan | ||
# will be handled by _str_map array method | ||
if is_object_dtype(result_dtype): | ||
result[mask] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is effectively the _str_map array method.
_str_map is not part of the BaseStringArrayMethods interface so shouldn't call it from here, but maybe no big deal since that is internal.
The _str_map method calls maybe_convert_objects so we need to add an extra keyword to so that we always return object (or string) arrays.
Have dumbed down the changes to just the materialization of the array. Iteration on a pyarow array is the main bottleneck here.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase
asv_bench/benchmarks/strings.py
Outdated
@@ -248,6 +248,24 @@ def time_rsplit(self, dtype, expand): | |||
self.s.str.rsplit("--", expand=expand) | |||
|
|||
|
|||
class Extract: | |||
|
|||
params = (["str", "string", "arrow_string"], [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob should make a base class to have these params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and include setup there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea.
very nice |