-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Significant speed difference between arr.mean()
and arr.values.mean()
for common dtype
columns
#34773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
arr.mean()
and arr.values.mean()
for common dtype
columnsarr.mean()
and arr.values.mean()
for common dtype
columns
on your methodology, be sure to time with and w/o bottlenck
i think it should be clear that pandas mean is doing a lot more work than numpy by
I suppose we don't care about inf checking in this case. I think this was here historically because we may (depending on some options) treat these as NaN's and exclude them. happy to take a PR here to remove that checking. |
naive checking (we are doing [7])
|
@ianozsvald when comparing to numpy for floats, you should actually compare with
You can see in the above that compared to |
Many thanks for the comprehensive replies, I'll digest these and get back to you. I hadn't realised that |
We're not actually using So the main reason pandas is slower compared to numpy is because we have "skipping missing values" by default, which numpy doesn't do. BTW, there is coming a "nullable float" dtype (#34307), similarly as the nullable integer dtype, where pd.NA is used instead of NaN as the missing value indicator (using a mask under the hood), and that is actually faster than the "nanfunc" approach:
(showing "sum" instead of "mean", because for mean we don't yet have the faster "masked" implementation, #34754) |
Moving off the 1.1 milestone. Is there anything concrete to do here? |
Hi @TomAugspurger . I'm not sure there's anything to be done here - dropping to NumPy and calling |
i’ll retract my claim that checking for inf matters in the pandas side (it doesn’t matter much) though we should remove that extra code that we have in cython i think |
Yeah, I don't think there is anything actionable right now (the inf checking is only done on the result, I think). |
Thanks for the issue, but as mentioned it appears this difference in performance is expected and there is no action to be taken as of now. Closing |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
I'm seeing a significant variance in timings for common math operations (e.g.
mean
,std
,max
) on a large PandasSeries
vs the underlying NumPyarray
. An code example is shown below with 1 million elements and a 10x speed difference. The screenshot below uses 10 million elements.I've generated a testing module (https://github.com/ianozsvald/dtype_pandas_numpy_speed_test) which several people have tried on Intel & AMD hardware: ianozsvald/dtype_pandas_numpy_speed_test#1
This module confirms the general trend that all of these operations are faster on the underlying NumPy array (not unsurprising as it avoids the despatch machinery) but for float operations the speed hit using Pandas seems to be extreme:
Code Sample, a copy-pastable example
A Python module exists in this repo along with reports from several other users with screenshots of their graphs, the same general behaviour is seen across different machines: https://github.com/ianozsvald/dtype_pandas_numpy_speed_test
Problem description
Is this slow-down expected? The slowdown feels extreme but perhaps my testing methodology is flawed? I expect the float & integer math to operate at approximately the same speed but instead we see a significant slow-down for Pandas float operations vs their NumPy counterparts.
I've added some extra graphs:
std
)mean
to contrast against the picture shown above in this report)Expected Output
Output of
pd.show_versions()
In [2]: pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.6.7-050607-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1.post20200529
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: