Skip to content

use numexpr for Series comparisons #32047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 26, 2020
Merged

Conversation

jbrockmendel
Copy link
Member

subsumes #31984.

@jorisvandenbossche
Copy link
Member

Does this give a better performance?

"Invalid comparison between dtype=int64 and Timestamp",
"'[<>]' not supported between instances of 'Timestamp' and 'int'",
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test explicitly with one message and then the other instead? or even better can we make the error messages consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, turns out we only need the new one, updated.

@simonjayhawkins simonjayhawkins added Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance labels Feb 17, 2020
@jbrockmendel
Copy link
Member Author

Does this give a better performance?

After running some timeings, I found a surprising and unacceptable 2x slowdown in homogeneous dataframe ops, so I reverted the edit in dispatch_to_series (which I thought, apparently incorrectly, was not doing anything useful).

These timings are after that reversion (I'll push shortly). We pick up some good ground for Series ops, its a wash for homogeneous DataFrame ops, and we lose a small (and confusing) amount of ground for heterogeneous DataFrame ops.

In [5]: ser = pd.Series(range(10**7))                                                                                                                           
In [6]: %timeit ser == ser                                                                                                                                      
6.86 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- master
4.68 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- PR

In [7]: df = pd.DataFrame({n: range(10**6) for n in range(10)})                                                                                                 
In [8]: %timeit df == df                                                                                                                                        
6.53 ms ± 61.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master
6.47 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- PR

In [8]: df[1] = df[1].astype(float)                                                                                                                             
In [9]: %timeit df == df                                                                                                                                        
12 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master
12.8 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- PR

@jreback
Copy link
Contributor

jreback commented Feb 19, 2020

so the reason we orignally wrote eval was to provide a quite significant speedup using multiple chained ops / comparisions, e.g.

https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#expression-evaluation-via-eval

(this is slightly orthogonal to this issue)

however,

https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#accelerated-operations are very relevant. a) we should have asv's for these, and be sure we are actually using the accelerated options (and likley update this link)

@jbrockmendel
Copy link
Member Author

yah, I think it was @TomAugspurger who recently discussed being more careful about what we pass to numexpr. I think in particular the tests can be improved to be not just "was numexpr called" but "was it called without falling back to python"

This is a blocker for the branch that implements frame-with-frame ops blockwise, so id like to minimize side-tracking.

@jreback jreback added this to the 1.1 milestone Feb 22, 2020
@@ -150,6 +152,8 @@ def na_arithmetic_op(left, right, op, str_rep: str):
try:
result = expressions.evaluate(op, str_rep, left, right)
except TypeError:
if is_cmp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what hits this AND is a is_cmp? can you add a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we have 7 cases that get here. 4 of these are something like ndarray[float] > datetime, and would pass if we removed this is_cmp check. Of the remaining three:

  • 2 involves passing a DatetimeArray to masked_arith_op, which expects ndarray. this should probably be prevented.
  • one involves comparions of complex numbers, which numpy does differently than python (which is its own PITA). falling through to the masked op would break test_bool_flex_frame_complex_dtype.

@@ -250,7 +257,10 @@ def comparison_op(
op_name = f"__{op.__name__}__"
method = getattr(lvalues, op_name)
with np.errstate(all="ignore"):
res_values = method(rvalues)
res_values = na_arithmetic_op(lvalues, rvalues, op, str_rep, is_cmp=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is thi not handled in na_arithmetic_op? seems odd to handle here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@jreback jreback merged commit e6bd49f into pandas-dev:master Feb 26, 2020
@jreback
Copy link
Contributor

jreback commented Feb 26, 2020

thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants