Skip to content

PERF: ArrowExtensionArray.__iter__ #49825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 22, 2022
Merged

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Nov 22, 2022

Faster iteration through pyarrow-backed arrays.

from asv_bench.benchmarks.strings import Iter

dtype = "string[pyarrow]"
bench = Iter()
bench.setup(dtype)

%timeit bench.time_iter(dtype)

# 266 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 57.7 ms ± 679 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Nov 22, 2022
Iterate over elements of the array.
"""
na_value = self._dtype.na_value
for value in self._data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benefit to combining the chunks of self._data and then iterating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe seeing a 5-10% improvement testing locally, but not much. Seems to be 5-10% with 10 chunks or 10000 chunks. I can add it if you want?

Copy link
Member

@mroeschke mroeschke Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the 5%-10% hold if there are a low number of chunks (1, 2, or 3)? I'd guess that most of the time there shouldn't be large amount of chunks since most construction is pd.chunked_array([scalars])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm seeing a consistent improvement at low number of chunks.

If curious, here is the snippet I'm testing with:

import pyarrow as pa

def time_iter(arr):
    for v in arr:
        pass

num_chunks = 2
chunk_size = 10**5
data = [np.arange(chunk_size)] * num_chunks
arr = pa.chunked_array(data)

%timeit time_iter(arr)
%timeit time_iter(arr.combine_chunks())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC doing combine_chunks will cause a copy, so any speed improvements will be offset by memory increase. if so, i think not worth it

@mroeschke mroeschke added this to the 2.0 milestone Nov 22, 2022
@mroeschke mroeschke merged commit a614b7a into pandas-dev:main Nov 22, 2022
@mroeschke
Copy link
Member

Thanks @lukemanley!

mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
* faster ArrowExtensionArray.__iter__

* gh ref
@lukemanley lukemanley deleted the arrow-ea-iter branch December 20, 2022 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants