-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
use numexpr for Series comparisons #32047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does this give a better performance? |
"Invalid comparison between dtype=int64 and Timestamp", | ||
"'[<>]' not supported between instances of 'Timestamp' and 'int'", | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test explicitly with one message and then the other instead? or even better can we make the error messages consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, turns out we only need the new one, updated.
After running some timeings, I found a surprising and unacceptable 2x slowdown in homogeneous dataframe ops, so I reverted the edit in dispatch_to_series (which I thought, apparently incorrectly, was not doing anything useful). These timings are after that reversion (I'll push shortly). We pick up some good ground for Series ops, its a wash for homogeneous DataFrame ops, and we lose a small (and confusing) amount of ground for heterogeneous DataFrame ops.
|
so the reason we orignally wrote eval was to provide a quite significant speedup using multiple chained ops / comparisions, e.g. (this is slightly orthogonal to this issue) however, https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#accelerated-operations are very relevant. a) we should have asv's for these, and be sure we are actually using the accelerated options (and likley update this link) |
yah, I think it was @TomAugspurger who recently discussed being more careful about what we pass to numexpr. I think in particular the tests can be improved to be not just "was numexpr called" but "was it called without falling back to python" This is a blocker for the branch that implements frame-with-frame ops blockwise, so id like to minimize side-tracking. |
@@ -150,6 +152,8 @@ def na_arithmetic_op(left, right, op, str_rep: str): | |||
try: | |||
result = expressions.evaluate(op, str_rep, left, right) | |||
except TypeError: | |||
if is_cmp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what hits this AND is a is_cmp? can you add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we have 7 cases that get here. 4 of these are something like ndarray[float] > datetime
, and would pass if we removed this is_cmp check. Of the remaining three:
- 2 involves passing a DatetimeArray to masked_arith_op, which expects ndarray. this should probably be prevented.
- one involves comparions of complex numbers, which numpy does differently than python (which is its own PITA). falling through to the masked op would break test_bool_flex_frame_complex_dtype.
@@ -250,7 +257,10 @@ def comparison_op( | |||
op_name = f"__{op.__name__}__" | |||
method = getattr(lvalues, op_name) | |||
with np.errstate(all="ignore"): | |||
res_values = method(rvalues) | |||
res_values = na_arithmetic_op(lvalues, rvalues, op, str_rep, is_cmp=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is thi not handled in na_arithmetic_op? seems odd to handle here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
thanks. |
subsumes #31984.