-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[ArrowStringArray] REF: str.extract dispatch to array #41372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doing this affects performance, will keep trying.
|
IIRC, the latest commit showed improvements, but since there have been perf improvements on master #41490, we are seeing worse perf. I doesn't seem clean, but the best way forward maybe to use python list comps for object dtypes instead of _str_map (which wraps map_infer_mask) or update _str_map to use list comps for object dtype.
|
just to be clear, improvements over the previous commit, not master. |
@simonjayhawkins this PR doesn't add the use of a pyarrow.compute function? (I think |
I haven't been fully following this PR up to now and the precursors you did, but can you try to explain again what the concern is here / why some benchmarks become slower (by refactoring part to dispatch to the array method) |
see op. will maybe add here if diff is reduced.
it's not the dispatch to array as such, but trying to use _str_map and _wrap_result cleanly and avoid union return types in the array function. _str_map doesn't allow assigning a list as the scalar and is slower than a list comp these DO NOT stop us dispatching to the array. probably the next PR now that #41539 is merged will do that. |
once #41541 is merged and _str_extract is moved to the array (following #41539 it's not much more than a move and add convert parameter to _str_map... unlike other string methods str.extract always returns a string-like array) the diff here should reduce considerably and the nits should be more obvious. As much as the union return type bugs me, a perf regression bugs me more. |
will close till then to help clear the queue |
marked as draft since it may be better to split the perf improvements from the dispatch to array refactor and also may be better to finish #41085 first.
also need to complete the parameterisation of the extract tests.
also maybe able to reduce the _wrap_result changes by passing a 2d array without expand kwarg so that expand is inferred and the code removed here will be bypassed. so will look into that before marking as ready for review.
if I can reduce the diff here (even if only by doing the test parameterisation as a precursor), then may be worth adding the pyarrow native implementation here also.