use numexpr for Series comparisons #32047

jbrockmendel · 2020-02-17T03:31:33Z

subsumes #31984.

jorisvandenbossche · 2020-02-17T08:25:38Z

Does this give a better performance?

simonjayhawkins · 2020-02-17T10:57:28Z

pandas/tests/arithmetic/test_numeric.py

+                "Invalid comparison between dtype=int64 and Timestamp",
+                "'[<>]' not supported between instances of 'Timestamp' and 'int'",
+            ]
+        )


can we test explicitly with one message and then the other instead? or even better can we make the error messages consistent?

yep, turns out we only need the new one, updated.

…pr-series

jbrockmendel · 2020-02-17T15:24:27Z

Does this give a better performance?

After running some timeings, I found a surprising and unacceptable 2x slowdown in homogeneous dataframe ops, so I reverted the edit in dispatch_to_series (which I thought, apparently incorrectly, was not doing anything useful).

These timings are after that reversion (I'll push shortly). We pick up some good ground for Series ops, its a wash for homogeneous DataFrame ops, and we lose a small (and confusing) amount of ground for heterogeneous DataFrame ops.

In [5]: ser = pd.Series(range(10**7))                                                                                                                           
In [6]: %timeit ser == ser                                                                                                                                      
6.86 ms ± 295 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- master
4.68 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- PR

In [7]: df = pd.DataFrame({n: range(10**6) for n in range(10)})                                                                                                 
In [8]: %timeit df == df                                                                                                                                        
6.53 ms ± 61.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master
6.47 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- PR

In [8]: df[1] = df[1].astype(float)                                                                                                                             
In [9]: %timeit df == df                                                                                                                                        
12 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master
12.8 ms ± 338 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- PR

…pr-series

jreback · 2020-02-19T00:34:41Z

so the reason we orignally wrote eval was to provide a quite significant speedup using multiple chained ops / comparisions, e.g.

https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#expression-evaluation-via-eval

(this is slightly orthogonal to this issue)

however,

https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#accelerated-operations are very relevant. a) we should have asv's for these, and be sure we are actually using the accelerated options (and likley update this link)

jbrockmendel · 2020-02-19T01:32:07Z

yah, I think it was @TomAugspurger who recently discussed being more careful about what we pass to numexpr. I think in particular the tests can be improved to be not just "was numexpr called" but "was it called without falling back to python"

This is a blocker for the branch that implements frame-with-frame ops blockwise, so id like to minimize side-tracking.

pandas/core/ops/array_ops.py

jreback · 2020-02-22T15:51:57Z

pandas/core/ops/array_ops.py

@@ -150,6 +152,8 @@ def na_arithmetic_op(left, right, op, str_rep: str):
    try:
        result = expressions.evaluate(op, str_rep, left, right)
    except TypeError:
+        if is_cmp:


what hits this AND is a is_cmp? can you add a comment

so we have 7 cases that get here. 4 of these are something like ndarray[float] > datetime, and would pass if we removed this is_cmp check. Of the remaining three:

2 involves passing a DatetimeArray to masked_arith_op, which expects ndarray. this should probably be prevented.

one involves comparions of complex numbers, which numpy does differently than python (which is its own PITA). falling through to the masked op would break test_bool_flex_frame_complex_dtype.

jreback · 2020-02-22T15:52:27Z

pandas/core/ops/array_ops.py

@@ -250,7 +257,10 @@ def comparison_op(
        op_name = f"__{op.__name__}__"
        method = getattr(lvalues, op_name)
        with np.errstate(all="ignore"):
-            res_values = method(rvalues)
+            res_values = na_arithmetic_op(lvalues, rvalues, op, str_rep, is_cmp=True)


why is thi not handled in na_arithmetic_op? seems odd to handle here

…pr-series

jreback · 2020-02-26T12:53:48Z

thanks.

use numexpr for Series comparisons

14b455e

jbrockmendel mentioned this pull request Feb 17, 2020

REF: dont use numexpr in places where it doesnt help. #31984

Closed

simonjayhawkins reviewed Feb 17, 2020

View reviewed changes

simonjayhawkins added Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance labels Feb 17, 2020

jbrockmendel added 2 commits February 17, 2020 07:05

Merge branch 'master' of https://github.com/pandas-dev/pandas into ex…

b3072e5

…pr-series

revert

6273e7d

jbrockmendel added 3 commits February 17, 2020 07:26

update exception message

3a55dc1

Merge branch 'master' of https://github.com/pandas-dev/pandas into ex…

e37ba75

…pr-series

Merge branch 'master' of https://github.com/pandas-dev/pandas into ex…

94fddbb

…pr-series

jreback added this to the 1.1 milestone Feb 22, 2020

jreback reviewed Feb 22, 2020

View reviewed changes

pandas/core/ops/array_ops.py Show resolved Hide resolved

jreback requested changes Feb 22, 2020

View reviewed changes

jbrockmendel added 3 commits February 23, 2020 12:54

Merge branch 'master' of https://github.com/pandas-dev/pandas into ex…

00e2069

…pr-series

handle it all inside na_arithmetic_op

23976f7

Merge branch 'master' of https://github.com/pandas-dev/pandas into ex…

c1d314d

…pr-series

jreback approved these changes Feb 26, 2020

View reviewed changes

jreback merged commit e6bd49f into pandas-dev:master Feb 26, 2020

jbrockmendel deleted the expr-series branch February 26, 2020 15:36

roberthdevries pushed a commit to roberthdevries/pandas that referenced this pull request Mar 2, 2020

use numexpr for Series comparisons (pandas-dev#32047)

0a54819

simonjayhawkins mentioned this pull request Aug 14, 2020

BUG: pandas 1.1.0 introduces string comparison bug for larger datasets #35700

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use numexpr for Series comparisons #32047

use numexpr for Series comparisons #32047

jbrockmendel commented Feb 17, 2020

jorisvandenbossche commented Feb 17, 2020

simonjayhawkins Feb 17, 2020

jbrockmendel Feb 17, 2020

jbrockmendel commented Feb 17, 2020

jreback commented Feb 19, 2020

jbrockmendel commented Feb 19, 2020

jreback Feb 22, 2020

jbrockmendel Feb 24, 2020

jreback Feb 22, 2020

jbrockmendel Feb 24, 2020

jreback commented Feb 26, 2020

use numexpr for Series comparisons #32047

use numexpr for Series comparisons #32047

Conversation

jbrockmendel commented Feb 17, 2020

jorisvandenbossche commented Feb 17, 2020

simonjayhawkins Feb 17, 2020

Choose a reason for hiding this comment

jbrockmendel Feb 17, 2020

Choose a reason for hiding this comment

jbrockmendel commented Feb 17, 2020

jreback commented Feb 19, 2020

jbrockmendel commented Feb 19, 2020

jreback Feb 22, 2020

Choose a reason for hiding this comment

jbrockmendel Feb 24, 2020

Choose a reason for hiding this comment

jreback Feb 22, 2020

Choose a reason for hiding this comment

jbrockmendel Feb 24, 2020

Choose a reason for hiding this comment

jreback commented Feb 26, 2020