-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: implement arithmetic, comparison, logical ops on arrays #28431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/ops/array_ops.py
Outdated
@@ -132,3 +152,171 @@ def na_arithmetic_op(left, right, op, str_rep, eval_kwargs): | |||
result = masked_arith_op(left, right, op) | |||
|
|||
return missing.dispatch_fill_zeros(op, left, right, result) | |||
|
|||
|
|||
def arithmetic_op(left, right, op, str_rep, eval_kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a docstring would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the flow a bit here?
When do you end up here? What has already been done?
Eg if you have a left and right where left.__op__(right)
would return NotImplemeted, but right.__op__(left)
works, do you end up here and in what form?
Maybe a few examples of left and right types and how they are extracted before ending up here would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the flow a bit here?
When do you end up here? What has already been done?
This will all go in a docstring before long (hopefully in a separate pass, as this is a blocker for arithmetic-perf which I'm eager to get done). The idea:
Everything Series
-specific is done outside of this func (same with the comparison and logical funcs): potentially returning NotImplemented
, finding res_name, checking/ensuring Series
alignment (comparison is different from the other two), calling _constructor
and __finalize__
.
We should never get here (for any of these funcs) with a DataFrame, Series, or Index for left
. We should never get here with a DataFrame for right. We can almost rule out Series or Index for right, the unfortunate exception being L176-180.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will all go in a docstring before long (hopefully in a separate pass, as this is a blocker for arithmetic-perf which I'm eager to get done)
Why not in this PR? Writing and reviewing a docstring should not take weeks, and it makes reviewing it here easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a simple docstring would be easy, but since we are dealing with very central methods with a number of corner cases, doing a thorough docstring is not. This PR is intended as a just-refactor, with a number of follow-ups when this blocker is done.
pandas/core/ops/array_ops.py
Outdated
), | ||
) | ||
|
||
lvalues = extract_array(left, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're completely done with deferring via NotImplemented by the time we get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. So for e.g. PandasArray we will be able to use these directly only after #28037 is in.
Added requested comments and types. Holding off on docstrings since doing those right will be non-trivial. |
reverted types because mypy was complaining about indexes.base |
@jreback gentle ping. This is the last blocker before I can implement block-wise ops for the scalar case (and one of the two Series cases) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice! Added a bunch of questions / comments
pandas/core/ops/array_ops.py
Outdated
@@ -132,3 +152,171 @@ def na_arithmetic_op(left, right, op, str_rep, eval_kwargs): | |||
result = masked_arith_op(left, right, op) | |||
|
|||
return missing.dispatch_fill_zeros(op, left, right, result) | |||
|
|||
|
|||
def arithmetic_op(left, right, op, str_rep, eval_kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the flow a bit here?
When do you end up here? What has already been done?
Eg if you have a left and right where left.__op__(right)
would return NotImplemeted, but right.__op__(left)
works, do you end up here and in what form?
Maybe a few examples of left and right types and how they are extracted before ending up here would help.
# cannot make the same assumption about `right`. This is because we need | ||
# to define `keep_null_freq` before calling extract_array on it. | ||
lvalues = left | ||
rvalues = extract_array(right, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but DatetimelikeArrays keep the freq
? So why is it not possible to extract up to the array level for those?
And for a scalar Timestamp, does this freq actually matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for a scalar Timestamp, does this freq actually matter?
The behavior is deprecated, but ts + 5
currently behaves like ts + 5*ts.freq
, so yes, freq
does matter.
but DatetimelikeArrays keep the freq ? So why is it not possible to extract up to the array level for those?
Suppose we have self = Series(range(5))
, so lvalues = np.array(range(5))
. Consider two cases for other
that we are going to add to self
:
other = DatetimeIndex(["2016-01-01"] * 5, tz="UTC")
other = Series(DatetimeIndex(["2016-01-01"] * 5, tz="UTC"))
In the first case, other
has a freq
attribute that is None, so the correct thing to do is raise NullFrequencyError (btw this behavior is deprecated as with Timestamp). In the second, there is no freq attribute, so the correct thing to do is raise TypeError. If we did the unwrapping before getting to arithmetic_op
, we would incorrectly raise NullFrequencyError in case 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move both extract_array's in here? would make groking this simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move both extract_array's in here? would make groking this simpler
I think the idea is to move both out longer term (once the freq business is fixed). If that is the case, I think it is fine for now to already have extract_array of left higher up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first case, other has a freq attribute that is None, so the correct thing to do is raise NullFrequencyError (btw this behavior is deprecated as with Timestamp). In the second, there is no freq attribute, so the correct thing to do is raise TypeError. If we did the unwrapping before getting to arithmetic_op, we would incorrectly raise NullFrequencyError in case 2.
OK, I see. But once the deprecation is removed, this will always be a type error? And then this does not need a special case here? (just for understanding, not expecting it be done in this PR ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But once the deprecation is removed, this will always be a type error? And then this does not need a special case here? (just for understanding, not expecting it be done in this PR ;))
Correct x 2.
lvalues = left | ||
rvalues = right | ||
|
||
rvalues = lib.item_from_zerodim(rvalues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't done yet higher up? (arithmetic_op is not doing this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arithmetic_op doesnt, but probably should. Things like this are the motivation for #28037.
rvalues = maybe_upcast_for_op(rvalues, lvalues.shape) | ||
|
||
if should_extension_dispatch(left, rvalues) or isinstance( | ||
rvalues, (ABCTimedeltaArray, ABCDatetimeArray, Timestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the check for rvalues being ABCTimedeltaArray or ABCDatetimeArray not incluced in should_extension_dispatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historically should_extension_dispatch was more focused on dispatching to EA implementations, while the TDA/DTA/Timestamp check is specific to handling the freq
attribute. Combining them wouldn't be unreasonable.
pandas/core/ops/array_ops.py
Outdated
def logical_op(left, right, op): | ||
from pandas.core.ops import should_extension_dispatch, dispatch_to_extension_op | ||
|
||
def na_op(x, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a non-nested function like na_arithmetic_op
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. the logical op is kind of a mess internal-consistency-wise, so I'm planning to do this in a dedicated PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do that here, since you are already doing the move from the other file in this PR? (it's just a copy paste I assume?)
Then a next PR editing it will be easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
updated with types and bare-bones docstrings |
|
||
|
||
def comparison_op( | ||
left: Union[np.ndarray, ABCExtensionArray], right: Any, op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed method body in detail but you might want to use ArrayLike
from pandas._typing
instead of the Union
here. The former is a TypeVar
which maintains type generically through the function
For example, if left
is an ndarray using the TypeVar
would mean that the function also returns an ndarray
. By contrast a Union
would allow left
to be a ndarray
but the function to return either an ndarray
or a ExtensionArray
NB: ndarray
and ABC*
classes both resolve to Any
so this helps documentation, but doesn't actually type check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if left is an ndarray using the TypeVar would mean that the function also returns an ndarray
We do not have that guarantee for these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is mostly just a cut/paste yes?
# cannot make the same assumption about `right`. This is because we need | ||
# to define `keep_null_freq` before calling extract_array on it. | ||
lvalues = left | ||
rvalues = extract_array(right, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move both extract_array's in here? would make groking this simpler
""" | ||
from pandas.core.ops import should_extension_dispatch, dispatch_to_extension_op | ||
|
||
# NB: We assume extract_array has already been called on left and right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above about extract_array
Correct. types and basic docstrings have been added. |
I agree that having extract_array in exactly one place is weird. We'll be able to get rid of it after the integer-addition deprecation is enforced (and keep_null_freq becomes unnecessary). The tradeoff is that doing all 6 extract_array calls inside the functions would break the typing "guarantees", which I'm starting to like. |
# cannot make the same assumption about `right`. This is because we need | ||
# to define `keep_null_freq` before calling extract_array on it. | ||
lvalues = left | ||
rvalues = extract_array(right, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move both extract_array's in here? would make groking this simpler
I think the idea is to move both out longer term (once the freq business is fixed). If that is the case, I think it is fine for now to already have extract_array of left higher up.
# cannot make the same assumption about `right`. This is because we need | ||
# to define `keep_null_freq` before calling extract_array on it. | ||
lvalues = left | ||
rvalues = extract_array(right, extract_numpy=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first case, other has a freq attribute that is None, so the correct thing to do is raise NullFrequencyError (btw this behavior is deprecated as with Timestamp). In the second, there is no freq attribute, so the correct thing to do is raise TypeError. If we did the unwrapping before getting to arithmetic_op, we would incorrectly raise NullFrequencyError in case 2.
OK, I see. But once the deprecation is removed, this will always be a type error? And then this does not need a special case here? (just for understanding, not expecting it be done in this PR ;))
---------- | ||
left : np.ndarray or ExtensionArray | ||
right : object | ||
Cannot be a DataFrame, Series, or Index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to be more precise and say that it is a ndarray, ExtensionArray or scalar?
Or are there still other types of objects that can be passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any wacky thing that you can pass to Series.__add__
can end up here with the exception of those three classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe __eq__
would be a better example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But eg a list, is that already coerced to an array by the time it is here?
Depending on what work has been done before this, the "any wacky thing you can pass" can still be considered as "a scalar"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some coercion (including list) is done a few lines down on L245-248.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes to everything up through "applied consistently". For the "or scalar" I'm reticent to make a definitive statement because we can brainstorm scalar-like things that wouldn't satisfy lib.is_scalar
(possible also some special cases for tuple)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scalar-like things that wouldn't satisfy lib.is_scalar
That's only a problem for object / generic EAs (ones we don't know) right? I mean for all other types you could have that guarantee at this location in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. A user can pass something like ser + type(ser)
or ser + pd
and we an get here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for all built-in types except object dtype, we know that such objects are invalid, so that could (consistently) be checked before getting here? (similarly like you want to move the length check out I thought)
(again, about long term plan, not this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could in principle. I don't see it as a priority.
pandas/core/ops/array_ops.py
Outdated
def logical_op(left, right, op): | ||
from pandas.core.ops import should_extension_dispatch, dispatch_to_extension_op | ||
|
||
def na_op(x, y): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do that here, since you are already doing the move from the other file in this PR? (it's just a copy paste I assume?)
Then a next PR editing it will be easier to review.
I agree with @jbrockmendel that it is nice to already know what types are passed to those functions. So I think what is done in the PR (do extract_array outside, except for the one case where it is needed for the deprecated functionality) is fine. It's only temporary (we should remove that deprecation for 1.0), and gives the least code flux afterwards (only that one case can be moved out afterwards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's give @jreback a day or two to see if he has time to look and then merge otherwise.
rebased+green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some minor comments, ping on green.
from pandas.core.ops.docstrings import ( | ||
_arith_doc_FRAME, | ||
_flex_comp_doc_FRAME, | ||
_make_flex_doc, | ||
_op_descriptions, | ||
) | ||
from pandas.core.ops.invalid import invalid_comparison | ||
from pandas.core.ops.invalid import invalid_comparison # noqa:F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed? (also on comp_method_OBJECT_ARRAY)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. They are imported elsewhere. This will get cleaned up as we get more stuff out of __init__
|
||
|
||
def na_logical_op(x, y, op): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can type / doc-string at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This function is headed for a pretty major re-write in follow-ups.
if should_extension_dispatch(lvalues, rvalues): | ||
res_values = dispatch_to_extension_op(op, lvalues, rvalues) | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be direct else/if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not without duplicating L378-383 i think
thanks @jbrockmendel couple of followup requests noted above |
Following #28413, #28396, and #28395, these are all pretty straightforward refactors.
arithmetic_op
,comparison_op
, andlogical_op
are going to become the ops that we callBlock
-wise. Hopefully alsoPandasArray
will use them directly.Other follow-ups I have in mind:
should_extension_dispatch
anddispatch_to_extension_op
so we dont need runtime importseval_kwargs
entirely, they're not really necessary at this pointnumexpr
for comparison/logical opsextract_array
Series
methods handle__finalize__
and alignmentlogical_op
'sna_op