-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF/PERF: ArrowStringArray.__setitem__ #46400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does this do anything to help with #45419? |
asv_bench/benchmarks/array.py
Outdated
@@ -1,7 +1,10 @@ | |||
import numpy as np | |||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if the policy should change, but AFAIK we guard the pyarrow import in the other benchmarks as it's an optional dependency and raise NotImplementedError so that the benchmarks get skipped when pyarrow not installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Thanks for pointing that out.
No, it does not. If it is decided to make these immutable, then this isn't needed. |
pandas/core/arrays/string_arrow.py
Outdated
elif isna(value): | ||
value_is_scalar = is_scalar(value) | ||
|
||
# NA -> None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would create a helper method like _validate_key()
to encapsulate all of this (ok on this class for now, but we likey want to push this to the ArrowExtensionArray (or maybe we need a ArrowIndexingMixin or similar), that can be later (or here if convenient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored pretty extensively, this logic is now self contained
pandas/core/arrays/string_arrow.py
Outdated
if not value_is_scalar: | ||
value = value[np.argsort(key)] | ||
|
||
# fast path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here would do something along the lines of
if can_fast_path(key):
return self.set_with_fast_path(....)
return self.set_via_chunk_iteration()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
thanks @lukemanley very nice! would take a PR to pushed a lot of these indexing methods to a mixin / int of ArrowExtensionArray as @mroeschke is planning on using chunked arrays for backing a new numeric arrow extension array. |
@jreback - sounds good. Would you suggest a separate mixin for the generic indexing methods or would you just add those to ArrowExtensionArray? |
IMO if this indexing code is generally applicable to a pyarrow chunked array with any dtype I think it would be best located on ArrowExtensionArray |
cc @jbrockmendel @mroeschke for comments on that prob ok to add directly to ArrowExtensionArray i think |
doc/source/whatsnew/v1.5.0.rst
file if fixing a bug or adding a new feature.This PR improves the performance of
ArrowStringArray.__setitem__
and avoids some existing behavior that can lead to "over-chunking" of the underlying pyarrowChunkedArray
as well as exponential performance.ASVs added: