-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: setting values via df.loc / df.iloc with pyarrow-backed columns #50248
Conversation
pandas/core/arrays/arrow/array.py
Outdated
value = self._maybe_convert_setitem_value(value) | ||
|
||
# fast path (GH50248) | ||
if com.is_null_slice(key): | ||
if is_scalar(value) and not pa_version_under6p0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to directly check not pa_version_under6p0
? If a user has data with a pyarrow data type I think it's implicitly assumed that pyarrow is installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that path hits pc.if_else
which I believe was added in 6.0. pc.if_else
was the fastest way I found to create an array of all the same value. There are slower options of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently bumped the min pyarrow version to 6.0 so I think this should be safe to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is this robust to the case when the value being set is not the same type as the original array e.g.
ser = pd.Series([1], dtype="int64"); ser[:] = "foo"
should not work - Related to the above, was this case before in setitem?
- Could you add tests for the null slice if/else branches you added?
It is robust in the sense that it raises:
However, different indexers are raising different errors:.
I suspect that refactoring setitem to not go through numpy would help with consistency and might make it cleaner.
added |
Okay glad to see that Arrow is erroring in this case. Could you add a test this invalid setting case? Yeah I think generally |
Added |
Thanks @lukemanley |
…pandas-dev#50248) * perf: ArrowExtensionArray.__setitem__(null_slice) * gh refs * fix test * add test for setitem null slice paths * add test * remove version check * fix text
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.Performance improvement in
ArrowExtensionArray.__setitem__
when key is a null slice. The rationale for adding a fast path here is that internally null slices get used in a variety of DataFrame setitem operations. Here is an example:The example above does not use a null slice, however, internally a null slice is used in
ExtensionBlock.set_inplace
:pandas/pandas/core/internals/blocks.py
Line 1646 in 0189674
ASV added: