REF/PERF: ArrowStringArray.setitem #46400

lukemanley · 2022-03-17T03:49:06Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/v1.5.0.rst file if fixing a bug or adding a new feature.

This PR improves the performance of ArrowStringArray.__setitem__ and avoids some existing behavior that can lead to "over-chunking" of the underlying pyarrow ChunkedArray as well as exponential performance.

import pandas as pd
import pandas._testing as tm
import pyarrow as pa

ca = pa.chunked_array([
    tm.rands_array(3, 1_000),
    tm.rands_array(3, 1_000),
    tm.rands_array(3, 1_000),
])

num_chunks = []
for n in [1000, 2000, 3000]:
  arr = pd.arrays.ArrowStringArray(ca)
  %timeit arr[:n] = "foo"
  num_chunks.append(arr._data.num_chunks)

print("num_chunks:", num_chunks)

# main
1.72 s ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.23 s ± 761 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
17.5 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
num_chunks: [1002, 2001, 3001]

# PR
88.5 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
145 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
199 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
num_chunks: [3, 3, 3]

ASVs added:

       before           after         ratio
     <main>           <arrow-setitem>
-      42.6±0.7ms       28.7±0.6ms     0.67  array.ArrowStringArray.time_setitem(True)
-      22.5±0.3ms       9.86±0.2ms     0.44  array.ArrowStringArray.time_setitem(False)
-     6.84±0.06ms          430±6μs     0.06  array.ArrowStringArray.time_setitem_list(False)
-      17.4±0.1ms          403±5μs     0.02  array.ArrowStringArray.time_setitem_list(True)
-      1.10±0.01s      3.51±0.04ms     0.00  array.ArrowStringArray.time_setitem_slice(True)
-      1.05±0.02s          259±3μs     0.00  array.ArrowStringArray.time_setitem_slice(False)

…titem

jbrockmendel · 2022-03-17T04:07:23Z

Does this do anything to help with #45419?

simonjayhawkins · 2022-03-17T11:21:14Z

asv_bench/benchmarks/array.py

@@ -1,7 +1,10 @@
 import numpy as np
+import pyarrow as pa


not sure if the policy should change, but AFAIK we guard the pyarrow import in the other benchmarks as it's an optional dependency and raise NotImplementedError so that the benchmarks get skipped when pyarrow not installed.

Updated. Thanks for pointing that out.

lukemanley · 2022-03-17T15:09:12Z

Does this do anything to help with #45419?

No, it does not. If it is decided to make these immutable, then this isn't needed.

jreback · 2022-03-18T00:26:24Z

pandas/core/arrays/string_arrow.py

-            elif isna(value):
+        value_is_scalar = is_scalar(value)
+
+        # NA -> None


i would create a helper method like _validate_key() to encapsulate all of this (ok on this class for now, but we likey want to push this to the ArrowExtensionArray (or maybe we need a ArrowIndexingMixin or similar), that can be later (or here if convenient).

I refactored pretty extensively, this logic is now self contained

jreback · 2022-03-18T00:27:53Z

pandas/core/arrays/string_arrow.py

+                if not value_is_scalar:
+                    value = value[np.argsort(key)]
+
+        # fast path


same thing here would do something along the lines of

if can_fast_path(key): return self.set_with_fast_path(....) return self.set_via_chunk_iteration()

jreback · 2022-03-18T16:18:15Z

thanks @lukemanley very nice!

would take a PR to pushed a lot of these indexing methods to a mixin / int of ArrowExtensionArray as @mroeschke is planning on using chunked arrays for backing a new numeric arrow extension array.

lukemanley · 2022-03-18T21:15:22Z

would take a PR to pushed a lot of these indexing methods to a mixin / int of ArrowExtensionArray as @mroeschke is planning on using chunked arrays for backing a new numeric arrow extension array.

@jreback - sounds good. Would you suggest a separate mixin for the generic indexing methods or would you just add those to ArrowExtensionArray?

mroeschke · 2022-03-18T22:21:20Z

IMO if this indexing code is generally applicable to a pyarrow chunked array with any dtype I think it would be best located on ArrowExtensionArray

jreback · 2022-03-18T22:21:54Z

cc @jbrockmendel @mroeschke for comments on that

prob ok to add directly to ArrowExtensionArray i think

lukemanley added 2 commits March 16, 2022 23:32

ArrowStringArray.__setitem__

e379a22

Merge remote-tracking branch 'upstream/main' into arrowstringarray-se…

e21c4ff

…titem

simonjayhawkins reviewed Mar 17, 2022

View reviewed changes

fixes

0e35f6a

lukemanley added ExtensionArray Extending pandas with custom dtypes or arrays. Arrow pyarrow functionality labels Mar 17, 2022

lukemanley added 2 commits March 17, 2022 12:43

whatsnew

f292054

fix test

773f375

jreback added this to the 1.5 milestone Mar 17, 2022

jreback requested changes Mar 18, 2022

View reviewed changes

lukemanley added 2 commits March 18, 2022 00:54

refactor

f44bcbb

fix docstring

76a25a9

jreback approved these changes Mar 18, 2022

View reviewed changes

jreback merged commit ec3eedd into pandas-dev:main Mar 18, 2022

lukemanley deleted the arrowstringarray-setitem branch March 20, 2022 23:18

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

REF/PERF: ArrowStringArray.__setitem__ (pandas-dev#46400)

fd47192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF/PERF: ArrowStringArray.setitem #46400

REF/PERF: ArrowStringArray.setitem #46400

lukemanley commented Mar 17, 2022 •

edited

Loading

jbrockmendel commented Mar 17, 2022

simonjayhawkins Mar 17, 2022

lukemanley Mar 17, 2022

lukemanley commented Mar 17, 2022

jreback Mar 18, 2022

lukemanley Mar 18, 2022

jreback Mar 18, 2022

lukemanley Mar 18, 2022

jreback commented Mar 18, 2022

lukemanley commented Mar 18, 2022

mroeschke commented Mar 18, 2022

jreback commented Mar 18, 2022

REF/PERF: ArrowStringArray.__setitem__ #46400

REF/PERF: ArrowStringArray.__setitem__ #46400

Conversation

lukemanley commented Mar 17, 2022 • edited Loading

jbrockmendel commented Mar 17, 2022

simonjayhawkins Mar 17, 2022

Choose a reason for hiding this comment

lukemanley Mar 17, 2022

Choose a reason for hiding this comment

lukemanley commented Mar 17, 2022

jreback Mar 18, 2022

Choose a reason for hiding this comment

lukemanley Mar 18, 2022

Choose a reason for hiding this comment

jreback Mar 18, 2022

Choose a reason for hiding this comment

lukemanley Mar 18, 2022

Choose a reason for hiding this comment

jreback commented Mar 18, 2022

lukemanley commented Mar 18, 2022

mroeschke commented Mar 18, 2022

jreback commented Mar 18, 2022

REF/PERF: ArrowStringArray.setitem #46400

REF/PERF: ArrowStringArray.setitem #46400

lukemanley commented Mar 17, 2022 •

edited

Loading