Skip to content

PERF: DataFrame.mean() is slow with datetime columns #31075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
richierocks opened this issue Jan 16, 2020 · 11 comments · Fixed by #55939
Closed

PERF: DataFrame.mean() is slow with datetime columns #31075

richierocks opened this issue Jan 16, 2020 · 11 comments · Fixed by #55939
Labels
Benchmark Performance (ASV) benchmarks Reduction Operations sum, mean, min, max, etc.

Comments

@richierocks
Copy link

richierocks commented Jan 16, 2020

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
from datetime import datetime

# A DataFrame containing a datetime64 column
n = 1000000
df = pd.DataFrame({
    "x": np.random.normal(0.0, 1.0, n),
    "d": pd.date_range(datetime.today(), periods=n, freq="1H").tolist()
})

# Calculating the mean of individual columns is pretty much instantaneous
df["x"].mean()
## 1000 loops, best of 3: 1.35 ms per loop
df["d"].mean()
## 100 loops, best of 3: 2.91 ms per loop

# Using the DataFrame's .mean() method, it takes a really long time
%timeit df.mean()
## 1 loop, best of 3: 9.23 s per loop

Problem description

When DataFrame contains a datetime64 column, the time taken to run the .mean() method for the whole DataFrame is thousands of times longer than than time taken to run the .mean() method on each column individually.

Expected Output

Answer is correct; just too slow.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.1.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.13.3
pytz : 2017.2
dateutil : 2.7.3
pip : 19.2.3
setuptools : 44.0.0.post20200106
Cython : 0.25.2
pytest : 3.0.7
hypothesis : None
sphinx : 1.5.6
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : 3.7.3
html5lib : 0.999
pymysql : 0.9.3
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 5.3.0
pandas_datareader: None
bs4 : 4.6.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 3.7.3
matplotlib : 2.0.2
numexpr : 2.6.2
odfpy : None
openpyxl : 2.4.7
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 0.19.1
sqlalchemy : 1.3.8
tables : 3.3.0
xarray : None
xlrd : 1.0.0
xlwt : 1.2.0
xlsxwriter : 0.9.6

I asked about this on Stack Overflow (https://stackoverflow.com/questions/59759107/how-to-avoid-poor-performance-of-pandas-mean-with-datetime-columns); one respondent hazarded a guess the the issue may lie with these lines of code (but no rigorous debugging was done).

https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/arrays/datetimes.py#L601-L603

@MarcoGorelli
Copy link
Member

Can confirm this reproduces on master, thanks for the report!

@jbrockmendel
Copy link
Member

Does performance improve if you pass “numeric_only=False” to dataframe.mean?

@jorisvandenbossche
Copy link
Member

This is because we first try to apply the operation on the "values" of the full dataframe:

pandas/pandas/core/frame.py

Lines 7930 to 7944 in 208bb41

if numeric_only is None:
values = self.values
try:
result = f(values)
if filter_type == "bool" and is_object_dtype(values) and axis is None:
# work around https://github.com/numpy/numpy/issues/10489
# TODO: combine with hasattr(result, 'dtype') further down
# hard since we don't have `values` down there.
result = np.bool_(result)
except TypeError:
# e.g. in nanops trying to convert strs to float
# try by-column first
if filter_type is None and axis == 0:

and it's creating those values (which creates an object array) which takes a lot fo time (and then also trying to reduce this object array).

@jorisvandenbossche jorisvandenbossche changed the title DataFrame.mean() is slow with datetime columns PERF: DataFrame.mean() is slow with datetime columns Jan 16, 2020
@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Jan 16, 2020
@jbrockmendel
Copy link
Member

#29941 may improve this

@jbrockmendel jbrockmendel added Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc. labels Sep 17, 2020
@lumbric
Copy link

lumbric commented May 5, 2021

Seems to be improved on master

The performance issue seems to be fixed improved on master, commit a43c42c, maybe due to #29941 as mentioned by @jbrockmendel:

In [7]: %timeit df.mean()
<magic-timeit>:1: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
5.19 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: pd.__version__
Out[8]: '1.3.0.dev0+1544.ga43c42c32d'

Edit: it seems to be improved for the given minimal example with datetime objects, but it's still broken if the dataframe contains strings. Also I am not sure if the column d should be missing in the result for the example with dates.

Result wrong too?

Is this really only a performance degradation? Why does df.mean() not give any result for column 'd' but df.d.mean() does?

With the minimal example defined above:

>>> df.mean()                                                                                                                       
x    0.002034
dtype: float64
>>> df.d.mean()                                                                                                                     
Timestamp('2078-05-20 01:15:34.385089536')

Behavior on master is the same.

Does numeric_only=True help?

Does performance improve if you pass “numeric_only=False” to dataframe.mean?

Yes. This seems to be a valid workaround:

%timeit df.mean(numeric_only=True)
1.7 ms ± 88.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

What does numeric_only do actually?

I am not sure if I understand this parameter. It seems to be a valid workaround for affected versions (e.g. 0.25.3), but it does not change the result. The docstring is also very confusing. Why is the default value None and not False? What does it mean to "attempt to use everything" but "then use only numeric data"? Does it try to convert strings to floats and then calculate the mean? If this is intended, it does not seem to work correctly.

numeric_onlybool, default None

    Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

@lumbric
Copy link

lumbric commented May 5, 2021

Oh I think I understood now what's going on. Things do not seem to be completely fixed on master, I have edited my previous post. But this might be a different issue.

Pandas tries to sum all elements and then tries to convert it to numbers. For strings "42", "42", "42" this would be "42" + "42" + "42" which is then converted to 424242.0 and divided by 3. Does this really make sense for a default df.mean() operation on dataframes? It basically means that you start a pretty computation intensive operation if you try to calculate a mean on a dataset with strings in one column.

n = 3
df = pd.DataFrame({
    "x":  np.random.normal(0.0, 1.0, n),
    "s": ["42"] * n,
})
df.mean()

The example above returns:

x        -0.554223
s    141414.000000
dtype: float64

Another example which causes a slowdown because the string needs to be concatenated before Pandas notices that this is not a number:

n = 100000
df = pd.DataFrame({
    "x":  np.random.normal(0.0, 1.0, n),
    "s": ["Some other longer string which causes troubles"] * n,
})
%timeit df.mean()
# 25.5 s ± 1.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# The column s is omitted in the result:
df.mean()
# x   -0.000547
# dtype: float64

All examples above tested on current master a43c42c.

@lumbric
Copy link

lumbric commented May 5, 2021

Note that there is also a Stackoverlfow Q&A about this.

@lumbric
Copy link

lumbric commented May 5, 2021

One more thing: the minimal example seems to be fast on the master, but why is the column d not in the result? To avoid breaking changes in the current version?

import numpy as np
import pandas as pd

# A DataFrame containing a datetime64 column
n = 1000000
df = pd.DataFrame({
    "x": np.random.normal(0.0, 1.0, n),
    "d": pd.date_range(pd.datetime.today(), periods=n, freq="1H").tolist()
})

df["d"].mean()
#  Timestamp('2078-05-20 02:33:45.791996416')

df.mean()
# <ipython-input-4-c61f0c8f89b5>:1: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
# x    0.000526
# dtype: float64

pd.__version__
#  '1.3.0.dev0+1544.ga43c42c32d'

@jbrockmendel
Copy link
Member

but it's still broken if the dataframe contains strings

yah, there are a bunch of issues about this, look for tags on "nuisance columns". This shouldn't be that hard to fix, just needs someone to make it a priority. Contributions welcome.

Also I am not sure if the column d should be missing in the result for the example with dates.

It should for now. That behavior is deprecated though, so the dt64 columns will be included once that deprecation is enforced.

@mroeschke
Copy link
Member

Since the original issue seems fixed, could use an asv benchmark.

@mroeschke mroeschke added Benchmark Performance (ASV) benchmarks and removed Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance labels Jul 27, 2021
@jbrockmendel
Copy link
Member

Relevant benchmark would go in asv_bench.benchmarks.stat_ops possibly part of or adjacent to FrameOps. Based on a quick pass I don't see any mixed-dtype reduction asvs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants