-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: ArrowExtensionArray.__iter__ #49825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Iterate over elements of the array. | ||
""" | ||
na_value = self._dtype.na_value | ||
for value in self._data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any benefit to combining the chunks of self._data
and then iterating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe seeing a 5-10% improvement testing locally, but not much. Seems to be 5-10% with 10 chunks or 10000 chunks. I can add it if you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the 5%-10% hold if there are a low number of chunks (1, 2, or 3)? I'd guess that most of the time there shouldn't be large amount of chunks since most construction is pd.chunked_array([scalars])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm seeing a consistent improvement at low number of chunks.
If curious, here is the snippet I'm testing with:
import pyarrow as pa
def time_iter(arr):
for v in arr:
pass
num_chunks = 2
chunk_size = 10**5
data = [np.arange(chunk_size)] * num_chunks
arr = pa.chunked_array(data)
%timeit time_iter(arr)
%timeit time_iter(arr.combine_chunks())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC doing combine_chunks will cause a copy, so any speed improvements will be offset by memory increase. if so, i think not worth it
Thanks @lukemanley! |
* faster ArrowExtensionArray.__iter__ * gh ref
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.Faster iteration through pyarrow-backed arrays.