Skip to content

PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 17, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Dec 14, 2022

Performance improvement in ArrowExtensionArray.__setitem__ when key is a null slice. The rationale for adding a fast path here is that internally null slices get used in a variety of DataFrame setitem operations. Here is an example:

import pandas as pd
import numpy as np

arr = pd.array(np.arange(10**6), dtype="int64[pyarrow]")
df = pd.DataFrame({'a': arr, 'b': arr})

%timeit df.iloc[0, 0] = 0

# 483 ms ± 74.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 3.4 ms ± 590 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <- PR

The example above does not use a null slice, however, internally a null slice is used in ExtensionBlock.set_inplace:

self.values[:] = values

ASV added:

       before           after         ratio
     [7ef6a71c]       [4e4b36d4]
                      <arrow-ea-setitem-null-slice>
-         802±2μs       57.4±0.5μs     0.07  array.ArrowStringArray.time_setitem_null_slice(False)
-     3.12±0.04ms          194±2μs     0.06  array.ArrowStringArray.time_setitem_null_slice(True)

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Dec 14, 2022
@lukemanley lukemanley changed the title PERF: df.loc / df.iloc with pyarrow-backed columns PERF: setting values via df.loc / df.iloc with pyarrow-backed columns Dec 14, 2022
value = self._maybe_convert_setitem_value(value)

# fast path (GH50248)
if com.is_null_slice(key):
if is_scalar(value) and not pa_version_under6p0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to directly check not pa_version_under6p0? If a user has data with a pyarrow data type I think it's implicitly assumed that pyarrow is installed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that path hits pc.if_else which I believe was added in 6.0. pc.if_else was the fastest way I found to create an array of all the same value. There are slower options of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently bumped the min pyarrow version to 6.0 so I think this should be safe to remove

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is this robust to the case when the value being set is not the same type as the original array e.g. ser = pd.Series([1], dtype="int64"); ser[:] = "foo" should not work
  2. Related to the above, was this case before in setitem?
  3. Could you add tests for the null slice if/else branches you added?

@lukemanley
Copy link
Member Author

  1. Is this robust to the case when the value being set is not the same type as the original array e.g. ser = pd.Series([1], dtype="int64"); ser[:] = "foo" should not work
  2. Related to the above, was this case before in setitem?

It is robust in the sense that it raises:

import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[:] = "foo"

# with this PR:
ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

# Previously:
ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName

However, different indexers are raising different errors:.

import pandas as pd
ser = pd.Series([1], dtype="int64[pyarrow]")
ser[0] = "foo"

# ArrowNotImplementedError: NumPy type not implemented: unrecognized type (19) in GetNumPyTypeName

I suspect that refactoring setitem to not go through numpy would help with consistency and might make it cleaner.

  1. Could you add tests for the null slice if/else branches you added?

added

@mroeschke
Copy link
Member

Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?

Yeah I think generally ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64 is a better error message but the refactoring can be left for a separate PR

@lukemanley
Copy link
Member Author

Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case?

Added

@mroeschke mroeschke added this to the 2.0 milestone Dec 17, 2022
@mroeschke mroeschke merged commit b02e41a into pandas-dev:main Dec 17, 2022
@mroeschke
Copy link
Member

Thanks @lukemanley

phofl pushed a commit to phofl/pandas that referenced this pull request Dec 17, 2022
…pandas-dev#50248)

* perf: ArrowExtensionArray.__setitem__(null_slice)

* gh refs

* fix test

* add test for setitem null slice paths

* add test

* remove version check

* fix text
@lukemanley lukemanley deleted the arrow-ea-setitem-null-slice branch December 20, 2022 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants