Skip to content

Re-evaluate the minimum number of elements to use numexpr for elementwise ops #40500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Mar 18, 2021 · 10 comments · Fixed by #40609
Closed
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Currently we have a MIN_ELEMENTS set at 10,000:

# the minimum prod shape that we will use numexpr
_MIN_ELEMENTS = 10000

However, I have been noticing while running lots of performance comparisons recently, that numexpr still seems to show some overhead at that array size compared to numpy.

I did a few specific timings for a few ops comparing numpy and numexpr for a set of different array sizes:

image

Code used to create the plot
import operator

import numpy as np
import pandas as pd
import numexpr as ne

import seaborn as sns

results = []

for s in [10**3, 10**4, 10**5, 10**6, 10**7, 10**8]:
    arr1 = np.random.randn(s)
    arr2 = np.random.randn(s)
    
    for op_str, op in [("+", operator.add), ("*", operator.mul), ("==", operator.eq), ("<=", operator.le)]:
        
        res_ne = %timeit -o ne.evaluate(f"a {op_str} b", local_dict={"a": arr1, "b": arr2}, casting="safe")
        res_np = %timeit -o op(arr1, arr2)
        
        results.append({"size": s, "op": op_str, "engine": "numepxr", "timing": res_ne.average, "timing_stdev": res_np.stdev})
        results.append({"size": s, "op": op_str, "engine": "numpy", "timing": res_np.average, "timing_stdev": res_np.stdev})
        

df = pd.DataFrame(results)

fig = sns.relplot(data=df, x="size", y="timing", hue="engine", col="op", kind="line", col_wrap=2)
fig.set(xscale='log', yscale='log')

So in general, numexpr is not that much faster for the large arrays. But specifically, it still has a significant overhead compared to numpy up to 1e5 - 1e6, while the current minimum number of elements is 1e4.

Further, this might depend on your specific hardware and versions etc (this was run on my linux laptop with 8 cores, using latest versions of numpy and numexpr). So it is always hard to give a default suitable for all.

But based on the analysis above, I would propose raising the minimum from 1e4 to 1e5 (or maybe even 1e6).

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Mar 18, 2021
@jorisvandenbossche
Copy link
Member Author

An example from the arithmetic.IntFrameWithScalar.time_frame_op_with_scalar benchmark, which basically used the following code snippet:

pd.options.mode.data_manager = "array"

dtype = np.float64
arr = np.random.randn(20000, 100)
df = pd.DataFrame(arr.astype(dtype))
scalar = 3.0

Using the current MIN_ELEMENTS of 1e4:

In [3]: %timeit df <= scalar
9.58 ms ± 945 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Changing the MIN_ELEMENTS to 1e5 (which means that in this case, numpy will be used):

In [3]: %timeit df <= scalar
2.92 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So here, the overhead is very clear, which is specifically true for ArrayManager which does the ops column-by-column, so paying the overhead for each column again.

Using BlockManager, the above benchmark doesn't change, because it still uses numexpr (the whole block size is still above 1e5 elements).

@jbrockmendel
Copy link
Member

If these results are representative, id question the value of using numexpr at all.

@jbrockmendel
Copy link
Member

Following the same code used to create the plot in the OP:

df2 = df.set_index(["size", "op", "engine"])
df3 = df2['timing']
df4 = df3.unstack('engine')

In [32]: df4['numepxr'] / df4['numpy']
    ...: 
Out[32]: 
size       op
1000       *     11.550891
           +     11.219127
           <=    13.992761
           ==    14.566631
10000      *     12.232232
           +     12.482489
           <=    16.391351
           ==    16.528711
100000     *      1.901619
           +      2.050886
           <=     2.245838
           ==     2.364996
1000000    *      0.980222
           +      0.978871
           <=     0.379410
           ==     0.398617
10000000   *      0.937947
           +      0.931268
           <=     0.719618
           ==     0.698704
100000000  *      0.513396
           +      0.555649
           <=     0.759315
           ==     0.688430
dtype: float64

These make numexpr look better than it did in the OP, though that might just be the log scale.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Mar 18, 2021

Thanks for running it as well! Numbers from different environments are useful.

These make numexpr look better than it did in the OP, though that might just be the log scale.

Yeah, indeed, the log scale was mainly to be able to show the difference on the smaller sizes (without log scale that wouldn't be visible). There is indeed still an advantage of numexpr (although the differences I see locally are smaller).

But to conclude, your numbers support the same conclusion I think that 1e4 as threshold is too small, and it should be at least 1e5 or even 1e6.

@rhshadrach
Copy link
Member

Results from my laptop (Core i7-10850H) are about the same
size       op
1000       *     11.379373
           +     12.133413
           <=    14.257864
           ==    14.337322
10000      *      9.552697
           +      9.197450
           <=    12.201087
           ==    12.175155
100000     *      1.483704
           +      1.404112
           <=     1.854857
           ==     1.892871
1000000    *      0.822717
           +      0.822774
           <=     0.464316
           ==     0.484865
10000000   *      1.166277
           +      1.155721
           <=     0.663739
           ==     0.630829
100000000  *      0.767271
           +      0.781812
           <=     0.635647
           ==     0.644891
dtype: float64

@jreback
Copy link
Contributor

jreback commented Mar 19, 2021

the original benchmarks for using numexpr were a number of years ago

it's certainly possible that numpy has improved in the interim

so +1 on raising the min elements

@jorisvandenbossche
Copy link
Member Author

OK, based on the numbers above, 1e6 seems a safer minimum than 1e5. Update the PR to reflect that: #40502

@rhshadrach
Copy link
Member

Is this good to close @jorisvandenbossche?

@rhshadrach
Copy link
Member

The issue in #40502 appears to be in test_expressions. Some of DataFrames tested on there used to be 300KiB, but were changed to being 30MiB. It looks like many copies are being made in setup_method, resulting in large memory usage.

@jorisvandenbossche
Copy link
Member Author

Yeah, I checked that at the time of doing the PR, and thought that 30MB won't be a big deal, but of course it's created and copied multiple times, .. (and also already created during test discovery and kept alive during the full test run), so underestimated the impact.

Next attempt: #40609

@jreback jreback added this to the 1.3 milestone Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
4 participants