Skip to content

REF: implement arithmetic, comparison, logical ops on arrays #28431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Sep 23, 2019

Conversation

jbrockmendel
Copy link
Member

Following #28413, #28396, and #28395, these are all pretty straightforward refactors.

arithmetic_op, comparison_op, and logical_op are going to become the ops that we call Block-wise. Hopefully also PandasArray will use them directly.

Other follow-ups I have in mind:

  • Move should_extension_dispatch and dispatch_to_extension_op so we dont need runtime imports
  • get rid of eval_kwargs entirely, they're not really necessary at this point
  • see if we can use numexpr for comparison/logical ops
  • docstrings
  • address inconsistencies in when we call extract_array
  • address inconsistencies in how Series methods handle __finalize__ and alignment
  • simplify logical_op's na_op

@@ -132,3 +152,171 @@ def na_arithmetic_op(left, right, op, str_rep, eval_kwargs):
result = masked_arith_op(left, right, op)

return missing.dispatch_fill_zeros(op, left, right, result)


def arithmetic_op(left, right, op, str_rep, eval_kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a docstring would be nice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the flow a bit here?
When do you end up here? What has already been done?
Eg if you have a left and right where left.__op__(right) would return NotImplemeted, but right.__op__(left) works, do you end up here and in what form?

Maybe a few examples of left and right types and how they are extracted before ending up here would help.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the flow a bit here?
When do you end up here? What has already been done?

This will all go in a docstring before long (hopefully in a separate pass, as this is a blocker for arithmetic-perf which I'm eager to get done). The idea:

Everything Series-specific is done outside of this func (same with the comparison and logical funcs): potentially returning NotImplemented, finding res_name, checking/ensuring Series alignment (comparison is different from the other two), calling _constructor and __finalize__.

We should never get here (for any of these funcs) with a DataFrame, Series, or Index for left. We should never get here with a DataFrame for right. We can almost rule out Series or Index for right, the unfortunate exception being L176-180.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will all go in a docstring before long (hopefully in a separate pass, as this is a blocker for arithmetic-perf which I'm eager to get done)

Why not in this PR? Writing and reviewing a docstring should not take weeks, and it makes reviewing it here easier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a simple docstring would be easy, but since we are dealing with very central methods with a number of corner cases, doing a thorough docstring is not. This PR is intended as a just-refactor, with a number of follow-ups when this blocker is done.

),
)

lvalues = extract_array(left, extract_numpy=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're completely done with deferring via NotImplemented by the time we get here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. So for e.g. PandasArray we will be able to use these directly only after #28037 is in.

@jbrockmendel
Copy link
Member Author

Added requested comments and types. Holding off on docstrings since doing those right will be non-trivial.

@jbrockmendel
Copy link
Member Author

reverted types because mypy was complaining about indexes.base

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Sep 15, 2019
@jbrockmendel
Copy link
Member Author

@jreback gentle ping. This is the last blocker before I can implement block-wise ops for the scalar case (and one of the two Series cases)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice! Added a bunch of questions / comments

@@ -132,3 +152,171 @@ def na_arithmetic_op(left, right, op, str_rep, eval_kwargs):
result = masked_arith_op(left, right, op)

return missing.dispatch_fill_zeros(op, left, right, result)


def arithmetic_op(left, right, op, str_rep, eval_kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the flow a bit here?
When do you end up here? What has already been done?
Eg if you have a left and right where left.__op__(right) would return NotImplemeted, but right.__op__(left) works, do you end up here and in what form?

Maybe a few examples of left and right types and how they are extracted before ending up here would help.

# cannot make the same assumption about `right`. This is because we need
# to define `keep_null_freq` before calling extract_array on it.
lvalues = left
rvalues = extract_array(right, extract_numpy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but DatetimelikeArrays keep the freq ? So why is it not possible to extract up to the array level for those?

And for a scalar Timestamp, does this freq actually matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for a scalar Timestamp, does this freq actually matter?

The behavior is deprecated, but ts + 5 currently behaves like ts + 5*ts.freq, so yes, freq does matter.

but DatetimelikeArrays keep the freq ? So why is it not possible to extract up to the array level for those?

Suppose we have self = Series(range(5)), so lvalues = np.array(range(5)). Consider two cases for other that we are going to add to self:

  1. other = DatetimeIndex(["2016-01-01"] * 5, tz="UTC")
  2. other = Series(DatetimeIndex(["2016-01-01"] * 5, tz="UTC"))

In the first case, other has a freq attribute that is None, so the correct thing to do is raise NullFrequencyError (btw this behavior is deprecated as with Timestamp). In the second, there is no freq attribute, so the correct thing to do is raise TypeError. If we did the unwrapping before getting to arithmetic_op, we would incorrectly raise NullFrequencyError in case 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move both extract_array's in here? would make groking this simpler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move both extract_array's in here? would make groking this simpler

I think the idea is to move both out longer term (once the freq business is fixed). If that is the case, I think it is fine for now to already have extract_array of left higher up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first case, other has a freq attribute that is None, so the correct thing to do is raise NullFrequencyError (btw this behavior is deprecated as with Timestamp). In the second, there is no freq attribute, so the correct thing to do is raise TypeError. If we did the unwrapping before getting to arithmetic_op, we would incorrectly raise NullFrequencyError in case 2.

OK, I see. But once the deprecation is removed, this will always be a type error? And then this does not need a special case here? (just for understanding, not expecting it be done in this PR ;))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But once the deprecation is removed, this will always be a type error? And then this does not need a special case here? (just for understanding, not expecting it be done in this PR ;))

Correct x 2.

lvalues = left
rvalues = right

rvalues = lib.item_from_zerodim(rvalues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't done yet higher up? (arithmetic_op is not doing this?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arithmetic_op doesnt, but probably should. Things like this are the motivation for #28037.

rvalues = maybe_upcast_for_op(rvalues, lvalues.shape)

if should_extension_dispatch(left, rvalues) or isinstance(
rvalues, (ABCTimedeltaArray, ABCDatetimeArray, Timestamp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the check for rvalues being ABCTimedeltaArray or ABCDatetimeArray not incluced in should_extension_dispatch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historically should_extension_dispatch was more focused on dispatching to EA implementations, while the TDA/DTA/Timestamp check is specific to handling the freq attribute. Combining them wouldn't be unreasonable.

def logical_op(left, right, op):
from pandas.core.ops import should_extension_dispatch, dispatch_to_extension_op

def na_op(x, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a non-nested function like na_arithmetic_op ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. the logical op is kind of a mess internal-consistency-wise, so I'm planning to do this in a dedicated PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do that here, since you are already doing the move from the other file in this PR? (it's just a copy paste I assume?)
Then a next PR editing it will be easier to review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@TomAugspurger TomAugspurger added this to the 1.0 milestone Sep 17, 2019
@jbrockmendel
Copy link
Member Author

updated with types and bare-bones docstrings



def comparison_op(
left: Union[np.ndarray, ABCExtensionArray], right: Any, op
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed method body in detail but you might want to use ArrayLike from pandas._typing instead of the Union here. The former is a TypeVar which maintains type generically through the function

For example, if left is an ndarray using the TypeVar would mean that the function also returns an ndarray. By contrast a Union would allow left to be a ndarray but the function to return either an ndarray or a ExtensionArray

NB: ndarray and ABC* classes both resolve to Any so this helps documentation, but doesn't actually type check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if left is an ndarray using the TypeVar would mean that the function also returns an ndarray

We do not have that guarantee for these functions.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mostly just a cut/paste yes?

# cannot make the same assumption about `right`. This is because we need
# to define `keep_null_freq` before calling extract_array on it.
lvalues = left
rvalues = extract_array(right, extract_numpy=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move both extract_array's in here? would make groking this simpler

"""
from pandas.core.ops import should_extension_dispatch, dispatch_to_extension_op

# NB: We assume extract_array has already been called on left and right
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above about extract_array

@jbrockmendel
Copy link
Member Author

this is mostly just a cut/paste yes?

Correct. types and basic docstrings have been added.

@jbrockmendel
Copy link
Member Author

can you move both extract_array's in here? would make groking this simpler

I agree that having extract_array in exactly one place is weird. We'll be able to get rid of it after the integer-addition deprecation is enforced (and keep_null_freq becomes unnecessary).

The tradeoff is that doing all 6 extract_array calls inside the functions would break the typing "guarantees", which I'm starting to like.

# cannot make the same assumption about `right`. This is because we need
# to define `keep_null_freq` before calling extract_array on it.
lvalues = left
rvalues = extract_array(right, extract_numpy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move both extract_array's in here? would make groking this simpler

I think the idea is to move both out longer term (once the freq business is fixed). If that is the case, I think it is fine for now to already have extract_array of left higher up.

# cannot make the same assumption about `right`. This is because we need
# to define `keep_null_freq` before calling extract_array on it.
lvalues = left
rvalues = extract_array(right, extract_numpy=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first case, other has a freq attribute that is None, so the correct thing to do is raise NullFrequencyError (btw this behavior is deprecated as with Timestamp). In the second, there is no freq attribute, so the correct thing to do is raise TypeError. If we did the unwrapping before getting to arithmetic_op, we would incorrectly raise NullFrequencyError in case 2.

OK, I see. But once the deprecation is removed, this will always be a type error? And then this does not need a special case here? (just for understanding, not expecting it be done in this PR ;))

----------
left : np.ndarray or ExtensionArray
right : object
Cannot be a DataFrame, Series, or Index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to be more precise and say that it is a ndarray, ExtensionArray or scalar?
Or are there still other types of objects that can be passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any wacky thing that you can pass to Series.__add__ can end up here with the exception of those three classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe __eq__ would be a better example)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But eg a list, is that already coerced to an array by the time it is here?
Depending on what work has been done before this, the "any wacky thing you can pass" can still be considered as "a scalar"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some coercion (including list) is done a few lines down on L245-248.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to everything up through "applied consistently". For the "or scalar" I'm reticent to make a definitive statement because we can brainstorm scalar-like things that wouldn't satisfy lib.is_scalar (possible also some special cases for tuple)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalar-like things that wouldn't satisfy lib.is_scalar

That's only a problem for object / generic EAs (ones we don't know) right? I mean for all other types you could have that guarantee at this location in the code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. A user can pass something like ser + type(ser) or ser + pd and we an get here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for all built-in types except object dtype, we know that such objects are invalid, so that could (consistently) be checked before getting here? (similarly like you want to move the length check out I thought)

(again, about long term plan, not this PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could in principle. I don't see it as a priority.

def logical_op(left, right, op):
from pandas.core.ops import should_extension_dispatch, dispatch_to_extension_op

def na_op(x, y):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do that here, since you are already doing the move from the other file in this PR? (it's just a copy paste I assume?)
Then a next PR editing it will be easier to review.

@jorisvandenbossche
Copy link
Member

[@jreback] can you move both extract_array's in here? would make groking this simpler

I agree that having extract_array in exactly one place is weird. We'll be able to get rid of it after the integer-addition deprecation is enforced (and keep_null_freq becomes unnecessary).

The tradeoff is that doing all 6 extract_array calls inside the functions would break the typing "guarantees", which I'm starting to like.

I agree with @jbrockmendel that it is nice to already know what types are passed to those functions. So I think what is done in the PR (do extract_array outside, except for the one case where it is needed for the deprecated functionality) is fine. It's only temporary (we should remove that deprecation for 1.0), and gives the least code flux afterwards (only that one case can be moved out afterwards.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's give @jreback a day or two to see if he has time to look and then merge otherwise.

@jbrockmendel
Copy link
Member Author

rebased+green

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor comments, ping on green.

from pandas.core.ops.docstrings import (
_arith_doc_FRAME,
_flex_comp_doc_FRAME,
_make_flex_doc,
_op_descriptions,
)
from pandas.core.ops.invalid import invalid_comparison
from pandas.core.ops.invalid import invalid_comparison # noqa:F401
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed? (also on comp_method_OBJECT_ARRAY)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They are imported elsewhere. This will get cleaned up as we get more stuff out of __init__



def na_logical_op(x, y, op):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can type / doc-string at some point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This function is headed for a pretty major re-write in follow-ups.

if should_extension_dispatch(lvalues, rvalues):
res_values = dispatch_to_extension_op(op, lvalues, rvalues)

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be direct else/if

Copy link
Member Author

@jbrockmendel jbrockmendel Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without duplicating L378-383 i think

@jreback jreback merged commit b106108 into pandas-dev:master Sep 23, 2019
@jreback
Copy link
Contributor

jreback commented Sep 23, 2019

thanks @jbrockmendel couple of followup requests noted above

@jbrockmendel jbrockmendel deleted the array_ops branch September 23, 2019 14:29
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Sep 24, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants