PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248

lukemanley · 2022-12-14T00:12:05Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Performance improvement in ArrowExtensionArray.__setitem__ when key is a null slice. The rationale for adding a fast path here is that internally null slices get used in a variety of DataFrame setitem operations. Here is an example:

import pandas as pd
import numpy as np

arr = pd.array(np.arange(10**6), dtype="int64[pyarrow]")
df = pd.DataFrame({'a': arr, 'b': arr})

%timeit df.iloc[0, 0] = 0

# 483 ms ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 3.4 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

The example above does not use a null slice, however, internally a null slice is used in ExtensionBlock.set_inplace:

pandas/pandas/core/internals/blocks.py

Line 1646 in 0189674

self.values[:] = values

ASV added:

       before           after         ratio
     [7ef6a71c]       [4e4b36d4]
                      <arrow-ea-setitem-null-slice>
-         802±2μs       57.4±0.5μs     0.07  array.ArrowStringArray.time_setitem_null_slice(False)
-     3.12±0.04ms          194±2μs     0.06  array.ArrowStringArray.time_setitem_null_slice(True)

mroeschke · 2022-12-16T00:33:39Z

pandas/core/arrays/arrow/array.py

        value = self._maybe_convert_setitem_value(value)

+        # fast path (GH50248)
+        if com.is_null_slice(key):
+            if is_scalar(value) and not pa_version_under6p0:


Do we need to directly check not pa_version_under6p0? If a user has data with a pyarrow data type I think it's implicitly assumed that pyarrow is installed?

that path hits pc.if_else which I believe was added in 6.0. pc.if_else was the fastest way I found to create an array of all the same value. There are slower options of course.

We recently bumped the min pyarrow version to 6.0 so I think this should be safe to remove

mroeschke

Is this robust to the case when the value being set is not the same type as the original array e.g. ser = pd.Series([1], dtype="int64"); ser[:] = "foo" should not work
Related to the above, was this case before in setitem?
Could you add tests for the null slice if/else branches you added?

lukemanley · 2022-12-16T03:04:51Z

Is this robust to the case when the value being set is not the same type as the original array e.g. ser = pd.Series([1], dtype="int64"); ser[:] = "foo" should not work

Related to the above, was this case before in setitem?

It is robust in the sense that it raises:

import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[:] = "foo"

# with this PR:
ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

# Previously:
ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName

However, different indexers are raising different errors:.

import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[0] = "foo"

# ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName

I suspect that refactoring setitem to not go through numpy would help with consistency and might make it cleaner.

Could you add tests for the null slice if/else branches you added?

added

mroeschke · 2022-12-17T00:55:44Z

Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?

Yeah I think generally ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64 is a better error message but the refactoring can be left for a separate PR

lukemanley · 2022-12-17T02:47:41Z

Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?

Added

…ll-slice

mroeschke · 2022-12-17T19:09:56Z

Thanks @lukemanley

…pandas-dev#50248) * perf: ArrowExtensionArray.__setitem__(null_slice) * gh refs * fix test * add test for setitem null slice paths * add test * remove version check * fix text

perf: ArrowExtensionArray.__setitem__(null_slice)

4e4b36d

lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Dec 14, 2022

gh refs

ba454d5

lukemanley changed the title ~~PERF: df.loc / df.iloc with pyarrow-backed columns~~ PERF: setting values via df.loc / df.iloc with pyarrow-backed columns Dec 14, 2022

fix test

b8d005a

mroeschke reviewed Dec 16, 2022

View reviewed changes

add test for setitem null slice paths

6c49682

add test

2d817e7

lukemanley added 3 commits December 16, 2022 21:48

remove version check

c61f1aa

Merge remote-tracking branch 'upstream/main' into arrow-ea-setitem-nu…

034cdd1

…ll-slice

fix text

16e67b2

mroeschke approved these changes Dec 17, 2022

View reviewed changes

mroeschke added this to the 2.0 milestone Dec 17, 2022

mroeschke merged commit b02e41a into pandas-dev:main Dec 17, 2022

lukemanley deleted the arrow-ea-setitem-null-slice branch December 20, 2022 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248

PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248

lukemanley commented Dec 14, 2022 •

edited

Loading

mroeschke Dec 16, 2022

lukemanley Dec 16, 2022

mroeschke Dec 17, 2022

lukemanley Dec 17, 2022

mroeschke left a comment

lukemanley commented Dec 16, 2022

mroeschke commented Dec 17, 2022

lukemanley commented Dec 17, 2022

mroeschke commented Dec 17, 2022

PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248

PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248

Conversation

lukemanley commented Dec 14, 2022 • edited Loading

mroeschke Dec 16, 2022

Choose a reason for hiding this comment

lukemanley Dec 16, 2022

Choose a reason for hiding this comment

mroeschke Dec 17, 2022

Choose a reason for hiding this comment

lukemanley Dec 17, 2022

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

lukemanley commented Dec 16, 2022

mroeschke commented Dec 17, 2022

lukemanley commented Dec 17, 2022

mroeschke commented Dec 17, 2022

lukemanley commented Dec 14, 2022 •

edited

Loading