Skip to content

PERF: ArrowExtensionArray.__iter__ #49825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -600,6 +600,7 @@ Performance improvements
- Performance improvement in :meth:`DataFrame.join` when joining on a subset of a :class:`MultiIndex` (:issue:`48611`)
- Performance improvement for :meth:`MultiIndex.intersection` (:issue:`48604`)
- Performance improvement in ``var`` for nullable dtypes (:issue:`48379`).
- Performance improvement when iterating over a :class:`~arrays.ArrowExtensionArray` (:issue:`49825`).
- Performance improvements to :func:`read_sas` (:issue:`47403`, :issue:`47405`, :issue:`47656`, :issue:`48502`)
- Memory improvement in :meth:`RangeIndex.sort_values` (:issue:`48801`)
- Performance improvement in :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` when ``by`` is a categorical type and ``sort=False`` (:issue:`48976`)
Expand Down
13 changes: 13 additions & 0 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
ArrayLike,
Dtype,
FillnaOptions,
Iterator,
PositionalIndexer,
SortKind,
TakeIndexer,
Expand Down Expand Up @@ -334,6 +335,18 @@ def __getitem__(self, item: PositionalIndexer):
else:
return scalar

def __iter__(self) -> Iterator[Any]:
"""
Iterate over elements of the array.
"""
na_value = self._dtype.na_value
for value in self._data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any benefit to combining the chunks of self._data and then iterating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe seeing a 5-10% improvement testing locally, but not much. Seems to be 5-10% with 10 chunks or 10000 chunks. I can add it if you want?

Copy link
Member

@mroeschke mroeschke Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the 5%-10% hold if there are a low number of chunks (1, 2, or 3)? I'd guess that most of the time there shouldn't be large amount of chunks since most construction is pd.chunked_array([scalars])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm seeing a consistent improvement at low number of chunks.

If curious, here is the snippet I'm testing with:

import pyarrow as pa

def time_iter(arr):
    for v in arr:
        pass

num_chunks = 2
chunk_size = 10**5
data = [np.arange(chunk_size)] * num_chunks
arr = pa.chunked_array(data)

%timeit time_iter(arr)
%timeit time_iter(arr.combine_chunks())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC doing combine_chunks will cause a copy, so any speed improvements will be offset by memory increase. if so, i think not worth it

val = value.as_py()
if val is None:
yield na_value
else:
yield val

def __arrow_array__(self, type=None):
"""Convert myself to a pyarrow ChunkedArray."""
return self._data
Expand Down