Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: operate on arrays instead of Series in DataFrame/DataFrame ops #33561
Changes from 7 commits
1102f0d
cd0218c
44ec4f4
9ff61c4
c4b7823
b8de50e
103db6c
d16ced8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesnt need to be in this PR, but since you're attention is already focused on this line: this can be incorrect if
b
is e.g. Categorical, sinceb.iloc[i]
will have the scalar behavior instead of the Categorical (usually raising) behavior. (This is also a place where NaT causes issues, and the idea of only-one-NA causes me nightmares)The main idea I've had here is to use
b.iloc[[i]]
instead ofb.iloc[i]
, haven't gotten around to implementing it. Does this change make it easier to address the issue?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when using something like
b.iloc[[i]]
, you would need to handle broadcasting here as well (since the array ops expect either same length array or scalar). So since that is a pre-existing issue, I would leave that for a dedicated PR.Is there an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think there's an issue for this. If I had found one, i would have added the Numeric label
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have a perf impact? IIRC a while back I tried removing it because I thought it shouldn't, but found that it did
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, the
get_array_op
should take care of this?For example, the
arrays_ops.py::na_arithmetic_op
also callsexpressions.evaluate
to perform the actual op.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was what i thought too, and I could be remembering incorrectly. But this is worth double-checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you can just return
arrays
here, since that will be passed toleft._construct_result
(though the docstring does say DataFrame, i guess thats wrong ATM)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to explicitly use
_from_arrays
, as that is part of the reason that makes this PR a perf improvement (the_from_arrays
constructor specifically handles the case of a list of arrays corresponding to columns). Also, the default DataFrame constructor handles a list of arrays differently.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good