PERF: Significant speed difference between `arr.mean()` and `arr.values.mean()` for common `dtype` columns #34773

ianozsvald · 2020-06-14T17:49:29Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

I'm seeing a significant variance in timings for common math operations (e.g. mean, std, max) on a large Pandas Series vs the underlying NumPy array. An code example is shown below with 1 million elements and a 10x speed difference. The screenshot below uses 10 million elements.

I've generated a testing module (https://github.com/ianozsvald/dtype_pandas_numpy_speed_test) which several people have tried on Intel & AMD hardware: ianozsvald/dtype_pandas_numpy_speed_test#1

This module confirms the general trend that all of these operations are faster on the underlying NumPy array (not unsurprising as it avoids the despatch machinery) but for float operations the speed hit using Pandas seems to be extreme:

Code Sample, a copy-pastable example

A Python module exists in this repo along with reports from several other users with screenshots of their graphs, the same general behaviour is seen across different machines: https://github.com/ianozsvald/dtype_pandas_numpy_speed_test

# note this is copied from my README linked above.
# paste into IPython or a Notebook
import pandas as pd
import numpy as np
arr = pd.Series(np.ones(shape=1_000_000))
arr.values.dtype                                                                                                                                                         
Out[]: dtype('float64')

arr.values.mean() == arr.mean()                                                                                                                                           
Out[]: True

# call arr.mean() vs arr.values.mean(), note circa 10* speed difference
# with 4ms vs 0.4ms
%timeit arr.mean()
4.59 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit arr.values.mean()
485 µs ± 5.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# note that arr.values dereference is very cheap (nano seconds)
%timeit arr.values 
456 ns ± 0.828 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Problem description

Is this slow-down expected? The slowdown feels extreme but perhaps my testing methodology is flawed? I expect the float & integer math to operate at approximately the same speed but instead we see a significant slow-down for Pandas float operations vs their NumPy counterparts.

I've added some extra graphs:

https://github.com/ianozsvald/dtype_pandas_numpy_speed_test/blob/master/timings_1e7_std.png (10M elements with std)
https://github.com/ianozsvald/dtype_pandas_numpy_speed_test/blob/master/timings_1e8_mean.png (100M elements with mean to contrast against the picture shown above in this report)

Expected Output

Output of `pd.show_versions()`

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.6.7-050607-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1.post20200529
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

jreback · 2020-06-14T17:59:35Z

on your methodology, be sure to time with and w/o bottlenck

In [18]: import pandas as pd 
    ...: import numpy as np 
    ...: s = pd.Series(np.ones(shape=1_000_000))                                                                                                                                                                    

In [19]: pd.options.compute.use_bottleneck=False                                                                                                                                                                    

In [20]: %timeit s.mean()                                                                                                                                                                                           
2.83 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: pd.options.compute.use_bottleneck=True                                                                                                                                                                     

In [22]: %timeit s.mean()                                                                                                                                                                                           
1.21 ms ± 9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [23]: %timeit s.to_numpy().mean()                                                                                                                                                                                
365 µs ± 5.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [24]: %prun s.mean()                                                                                                                                                                                             
         99 function calls in 0.002 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.001    0.001 {built-in method bottleneck.reduce.nanmean}
        1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 nanops.py:155(_has_infs)
        4    0.000    0.000    0.000    0.000 _ufunc_config.py:39(seterr)
        1    0.000    0.000    0.002    0.002 series.py:4148(_reduce)
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.001    0.001 nanops.py:61(_f)
        1    0.000    0.000    0.001    0.001 nanops.py:97(f)

i think it should be clear that pandas mean is doing a lot more work than numpy by

checking & dispatching on appropriate dtypes (e.g. we do means of datetimes for example)
checking for infinity (the slowdown here)

I suppose we don't care about inf checking in this case. I think this was here historically because we may (depending on some options) treat these as NaN's and exclude them.

happy to take a PR here to remove that checking.

jreback · 2020-06-14T18:03:03Z

naive checking (we are doing [7])

In [3]: %timeit s.values[s.values==np.inf]                                                                                                                                                                          
466 µs ± 7.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: pd._libs.lib.has_infs_f8(s.values)                                                                                                                                                                          
Out[5]: False

In [7]: %timeit pd._libs.lib.has_infs_f8(s.values)                                                                                                                                                                  
1.53 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

jorisvandenbossche · 2020-06-14T19:38:03Z

@ianozsvald when comparing to numpy for floats, you should actually compare with nanmean instead of mean (as we are skipping NaNs by default):

In [6]: pd.options.compute.use_bottleneck=False  

In [7]: %timeit arr.mean()   
2.81 ms ± 63.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit arr.values.mean() 
376 µs ± 56.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit np.nanmean(arr.values)   
2.88 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [10]: pd.options.compute.use_bottleneck=True 

In [11]: %timeit arr.mean()  
1.14 ms ± 3.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

You can see in the above that compared to nanmean, numpy and pandas are more or less at the same speed (at least on my laptop), and when using bottleneck pandas is faster as numpy.

ianozsvald · 2020-06-15T07:20:24Z

Many thanks for the comprehensive replies, I'll digest these and get back to you. I hadn't realised that nanmean was used in the background in Pandas nor that bottleneck is now used in more places. Cheers!

jorisvandenbossche · 2020-06-15T07:43:47Z

We're not actually using np.nanmean, but our own implementation but which is (I suppose) doing something very similar as numpy's.

So the main reason pandas is slower compared to numpy is because we have "skipping missing values" by default, which numpy doesn't do.

BTW, there is coming a "nullable float" dtype (#34307), similarly as the nullable integer dtype, where pd.NA is used instead of NaN as the missing value indicator (using a mask under the hood), and that is actually faster than the "nanfunc" approach:

In [1]: arr = pd.Series(np.ones(shape=1_000_000))                                                                                                                                                                  

In [2]: arr2 = arr.astype("Float64")  

In [3]: %timeit arr.sum()  
1.93 ms ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit arr2.sum() 
978 µs ± 117 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(showing "sum" instead of "mean", because for mean we don't yet have the faster "masked" implementation, #34754)

TomAugspurger · 2020-06-17T15:50:45Z

Moving off the 1.1 milestone.

Is there anything concrete to do here?

ianozsvald · 2020-06-17T20:04:54Z

Hi @TomAugspurger . I'm not sure there's anything to be done here - dropping to NumPy and calling mean if you don't have any NaNs is fastest if you know your data and installing bottleneck offers a good improvement, but I don't think that Pandas is at fault here. I've learned a couple of things, notably that pd.mean != np.mean for the same dtype.

jreback · 2020-06-17T20:50:21Z

i’ll retract my claim that checking for inf matters in the pandas side (it doesn’t matter much)

though we should remove that extra code that we have in cython i think

jorisvandenbossche · 2020-06-18T06:41:23Z

Yeah, I don't think there is anything actionable right now (the inf checking is only done on the result, I think).
The speed difference is simply due to the fact that we do more, i.e. handling missing values (although some things in nanops.py could maybe be optimized), and the future nullable float dtypes close this gap in performance partly.

mroeschke · 2021-08-08T00:46:23Z

Thanks for the issue, but as mentioned it appears this difference in performance is expected and there is no action to be taken as of now. Closing

ianozsvald added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2020

ianozsvald changed the title ~~PERF: Significant difference between arr.mean() and arr.values.mean() for common dtype columns~~ PERF: Significant speed difference between arr.mean() and arr.values.mean() for common dtype columns Jun 14, 2020

jreback added Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 14, 2020

jreback added this to the 1.1 milestone Jun 14, 2020

TomAugspurger removed this from the 1.1 milestone Jun 17, 2020

jorisvandenbossche mentioned this issue Jun 19, 2020

ENH: nullable Float32/64 ExtensionArray #34307

Merged

2 tasks

miccoli mentioned this issue Nov 20, 2020

PERF: Severe performance hit on DataFrame.sum with multi index and 'skipna=False' #37976

Closed

3 tasks

mroeschke closed this as completed Aug 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Significant speed difference between `arr.mean()` and `arr.values.mean()` for common `dtype` columns #34773

PERF: Significant speed difference between `arr.mean()` and `arr.values.mean()` for common `dtype` columns #34773

ianozsvald commented Jun 14, 2020 •

edited

Loading

INSTALLED VERSIONS

jreback commented Jun 14, 2020

jreback commented Jun 14, 2020

jorisvandenbossche commented Jun 14, 2020

ianozsvald commented Jun 15, 2020

jorisvandenbossche commented Jun 15, 2020

TomAugspurger commented Jun 17, 2020

ianozsvald commented Jun 17, 2020

jreback commented Jun 17, 2020

jorisvandenbossche commented Jun 18, 2020

mroeschke commented Aug 8, 2021

PERF: Significant speed difference between arr.mean() and arr.values.mean() for common dtype columns #34773

PERF: Significant speed difference between arr.mean() and arr.values.mean() for common dtype columns #34773

Comments

ianozsvald commented Jun 14, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jun 14, 2020

jreback commented Jun 14, 2020

jorisvandenbossche commented Jun 14, 2020

ianozsvald commented Jun 15, 2020

jorisvandenbossche commented Jun 15, 2020

TomAugspurger commented Jun 17, 2020

ianozsvald commented Jun 17, 2020

jreback commented Jun 17, 2020

jorisvandenbossche commented Jun 18, 2020

mroeschke commented Aug 8, 2021

PERF: Significant speed difference between `arr.mean()` and `arr.values.mean()` for common `dtype` columns #34773

PERF: Significant speed difference between `arr.mean()` and `arr.values.mean()` for common `dtype` columns #34773

ianozsvald commented Jun 14, 2020 •

edited

Loading

Output of `pd.show_versions()`