Skip to content

REF/PERF: ArrowStringArray.__setitem__ #46400

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 18, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Mar 17, 2022

This PR improves the performance of ArrowStringArray.__setitem__ and avoids some existing behavior that can lead to "over-chunking" of the underlying pyarrow ChunkedArray as well as exponential performance.

import pandas as pd
import pandas._testing as tm
import pyarrow as pa

ca = pa.chunked_array([
    tm.rands_array(3, 1_000),
    tm.rands_array(3, 1_000),
    tm.rands_array(3, 1_000),
])

num_chunks = []
for n in [1000, 2000, 3000]:
  arr = pd.arrays.ArrowStringArray(ca)
  %timeit arr[:n] = "foo"
  num_chunks.append(arr._data.num_chunks)

print("num_chunks:", num_chunks)
# main
1.72 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.23 s ± 761 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.5 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
num_chunks: [1002, 2001, 3001]
# PR
88.5 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
145 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
199 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
num_chunks: [3, 3, 3]

ASVs added:

       before           after         ratio
     <main>           <arrow-setitem>
-      42.6±0.7ms       28.7±0.6ms     0.67  array.ArrowStringArray.time_setitem(True)
-      22.5±0.3ms       9.86±0.2ms     0.44  array.ArrowStringArray.time_setitem(False)
-     6.84±0.06ms          430±6μs     0.06  array.ArrowStringArray.time_setitem_list(False)
-      17.4±0.1ms          403±5μs     0.02  array.ArrowStringArray.time_setitem_list(True)
-      1.10±0.01s      3.51±0.04ms     0.00  array.ArrowStringArray.time_setitem_slice(True)
-      1.05±0.02s          259±3μs     0.00  array.ArrowStringArray.time_setitem_slice(False)

@jbrockmendel
Copy link
Member

Does this do anything to help with #45419?

@@ -1,7 +1,10 @@
import numpy as np
import pyarrow as pa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if the policy should change, but AFAIK we guard the pyarrow import in the other benchmarks as it's an optional dependency and raise NotImplementedError so that the benchmarks get skipped when pyarrow not installed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks for pointing that out.

@lukemanley
Copy link
Member Author

Does this do anything to help with #45419?

No, it does not. If it is decided to make these immutable, then this isn't needed.

@lukemanley lukemanley added ExtensionArray Extending pandas with custom dtypes or arrays. Arrow pyarrow functionality labels Mar 17, 2022
@jreback jreback added this to the 1.5 milestone Mar 17, 2022
elif isna(value):
value_is_scalar = is_scalar(value)

# NA -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would create a helper method like _validate_key() to encapsulate all of this (ok on this class for now, but we likey want to push this to the ArrowExtensionArray (or maybe we need a ArrowIndexingMixin or similar), that can be later (or here if convenient).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored pretty extensively, this logic is now self contained

if not value_is_scalar:
value = value[np.argsort(key)]

# fast path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here would do something along the lines of

if can_fast_path(key):
      return self.set_with_fast_path(....)

return self.set_via_chunk_iteration()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback merged commit ec3eedd into pandas-dev:main Mar 18, 2022
@jreback
Copy link
Contributor

jreback commented Mar 18, 2022

thanks @lukemanley very nice!

would take a PR to pushed a lot of these indexing methods to a mixin / int of ArrowExtensionArray as @mroeschke is planning on using chunked arrays for backing a new numeric arrow extension array.

@lukemanley
Copy link
Member Author

would take a PR to pushed a lot of these indexing methods to a mixin / int of ArrowExtensionArray as @mroeschke is planning on using chunked arrays for backing a new numeric arrow extension array.

@jreback - sounds good. Would you suggest a separate mixin for the generic indexing methods or would you just add those to ArrowExtensionArray?

@mroeschke
Copy link
Member

IMO if this indexing code is generally applicable to a pyarrow chunked array with any dtype I think it would be best located on ArrowExtensionArray

@jreback
Copy link
Contributor

jreback commented Mar 18, 2022

cc @jbrockmendel @mroeschke for comments on that

prob ok to add directly to ArrowExtensionArray i think

@lukemanley lukemanley deleted the arrowstringarray-setitem branch March 20, 2022 23:18
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants