-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: DataFrame.mean() is slow with datetime columns #31075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can confirm this reproduces on master, thanks for the report! |
Does performance improve if you pass “numeric_only=False” to dataframe.mean? |
This is because we first try to apply the operation on the "values" of the full dataframe: Lines 7930 to 7944 in 208bb41
and it's creating those values (which creates an object array) which takes a lot fo time (and then also trying to reduce this object array). |
#29941 may improve this |
Seems to be improved on masterThe performance issue seems to be
Edit: it seems to be improved for the given minimal example with datetime objects, but it's still broken if the dataframe contains strings. Also I am not sure if the column Result wrong too?Is this really only a performance degradation? Why does With the minimal example defined above:
Behavior on master is the same. Does
|
Oh I think I understood now what's going on. Things do not seem to be completely fixed on master, I have edited my previous post. But this might be a different issue. Pandas tries to sum all elements and then tries to convert it to numbers. For strings
The example above returns:
Another example which causes a slowdown because the string needs to be concatenated before Pandas notices that this is not a number:
All examples above tested on current master a43c42c. |
Note that there is also a Stackoverlfow Q&A about this. |
One more thing: the minimal example seems to be fast on the master, but why is the column
|
yah, there are a bunch of issues about this, look for tags on "nuisance columns". This shouldn't be that hard to fix, just needs someone to make it a priority. Contributions welcome.
It should for now. That behavior is deprecated though, so the dt64 columns will be included once that deprecation is enforced. |
Since the original issue seems fixed, could use an asv benchmark. |
Relevant benchmark would go in asv_bench.benchmarks.stat_ops possibly part of or adjacent to FrameOps. Based on a quick pass I don't see any mixed-dtype reduction asvs |
Code Sample, a copy-pastable example if possible
Problem description
When DataFrame contains a
datetime64
column, the time taken to run the.mean()
method for the whole DataFrame is thousands of times longer than than time taken to run the.mean()
method on each column individually.Expected Output
Answer is correct; just too slow.
Output of
pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.1.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.13.3
pytz : 2017.2
dateutil : 2.7.3
pip : 19.2.3
setuptools : 44.0.0.post20200106
Cython : 0.25.2
pytest : 3.0.7
hypothesis : None
sphinx : 1.5.6
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : 3.7.3
html5lib : 0.999
pymysql : 0.9.3
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 5.3.0
pandas_datareader: None
bs4 : 4.6.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 3.7.3
matplotlib : 2.0.2
numexpr : 2.6.2
odfpy : None
openpyxl : 2.4.7
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 0.19.1
sqlalchemy : 1.3.8
tables : 3.3.0
xarray : None
xlrd : 1.0.0
xlwt : 1.2.0
xlsxwriter : 0.9.6
I asked about this on Stack Overflow (https://stackoverflow.com/questions/59759107/how-to-avoid-poor-performance-of-pandas-mean-with-datetime-columns); one respondent hazarded a guess the the issue may lie with these lines of code (but no rigorous debugging was done).
https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/arrays/datetimes.py#L601-L603
The text was updated successfully, but these errors were encountered: