PERF: DataFrame.mean() is slow with datetime columns #31075

richierocks · 2020-01-16T15:25:28Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
from datetime import datetime

# A DataFrame containing a datetime64 column
n = 1000000
df = pd.DataFrame({
    "x": np.random.normal(0.0, 1.0, n),
    "d": pd.date_range(datetime.today(), periods=n, freq="1H").tolist()
})

# Calculating the mean of individual columns is pretty much instantaneous
df["x"].mean()
## 1000 loops, best of 3: 1.35 ms per loop
df["d"].mean()
## 100 loops, best of 3: 2.91 ms per loop

# Using the DataFrame's .mean() method, it takes a really long time
%timeit df.mean()
## 1 loop, best of 3: 9.23 s per loop

Problem description

When DataFrame contains a datetime64 column, the time taken to run the .mean() method for the whole DataFrame is thousands of times longer than than time taken to run the .mean() method on each column individually.

Expected Output

Answer is correct; just too slow.

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.1.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.13.3
pytz : 2017.2
dateutil : 2.7.3
pip : 19.2.3
setuptools : 44.0.0.post20200106
Cython : 0.25.2
pytest : 3.0.7
hypothesis : None
sphinx : 1.5.6
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : 3.7.3
html5lib : 0.999
pymysql : 0.9.3
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 5.3.0
pandas_datareader: None
bs4 : 4.6.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 3.7.3
matplotlib : 2.0.2
numexpr : 2.6.2
odfpy : None
openpyxl : 2.4.7
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 0.19.1
sqlalchemy : 1.3.8
tables : 3.3.0
xarray : None
xlrd : 1.0.0
xlwt : 1.2.0
xlsxwriter : 0.9.6

I asked about this on Stack Overflow (https://stackoverflow.com/questions/59759107/how-to-avoid-poor-performance-of-pandas-mean-with-datetime-columns); one respondent hazarded a guess the the issue may lie with these lines of code (but no rigorous debugging was done).

https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/arrays/datetimes.py#L601-L603

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2020-01-16T16:06:00Z

Can confirm this reproduces on master, thanks for the report!

jbrockmendel · 2020-01-16T20:09:04Z

Does performance improve if you pass “numeric_only=False” to dataframe.mean?

jorisvandenbossche · 2020-01-16T20:27:23Z

This is because we first try to apply the operation on the "values" of the full dataframe:

pandas/pandas/core/frame.py

Lines 7930 to 7944 in 208bb41

    
           if numeric_only is None: 
        
               values = self.values 
        
               try: 
        
                   result = f(values) 
        
                   if filter_type == "bool" and is_object_dtype(values) and axis is None: 
        
                       # work around https://github.com/numpy/numpy/issues/10489 
        
                       # TODO: combine with hasattr(result, 'dtype') further down 
        
                       # hard since we don't have `values` down there. 
        
                       result = np.bool_(result) 
        
               except TypeError: 
        
                   # e.g. in nanops trying to convert strs to float 
        
                   # try by-column first 
        
                   if filter_type is None and axis == 0:

and it's creating those values (which creates an object array) which takes a lot fo time (and then also trying to reduce this object array).

jbrockmendel · 2020-01-16T20:31:42Z

#29941 may improve this

lumbric · 2021-05-05T15:58:51Z

Seems to be improved on master

The performance issue seems to be ~~fixed~~ improved on master, commit a43c42c, maybe due to #29941 as mentioned by @jbrockmendel:

In [7]: %timeit df.mean()
<magic-timeit>:1: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
5.19 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: pd.__version__
Out[8]: '1.3.0.dev0+1544.ga43c42c32d'

Edit: it seems to be improved for the given minimal example with datetime objects, but it's still broken if the dataframe contains strings. Also I am not sure if the column d should be missing in the result for the example with dates.

Result wrong too?

Is this really only a performance degradation? Why does df.mean() not give any result for column 'd' but df.d.mean() does?

With the minimal example defined above:

>>> df.mean()                                                                                                                       
x    0.002034
dtype: float64
>>> df.d.mean()                                                                                                                     
Timestamp('2078-05-20 01:15:34.385089536')

Behavior on master is the same.

Does `numeric_only=True` help?

Does performance improve if you pass “numeric_only=False” to dataframe.mean?

Yes. This seems to be a valid workaround:

%timeit df.mean(numeric_only=True)
1.7 ms ± 88.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

What does `numeric_only` do actually?

I am not sure if I understand this parameter. It seems to be a valid workaround for affected versions (e.g. 0.25.3), but it does not change the result. The docstring is also very confusing. Why is the default value None and not False? What does it mean to "attempt to use everything" but "then use only numeric data"? Does it try to convert strings to floats and then calculate the mean? If this is intended, it does not seem to work correctly.

numeric_onlybool, default None

    Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

lumbric · 2021-05-05T16:57:51Z

Oh I think I understood now what's going on. Things do not seem to be completely fixed on master, I have edited my previous post. But this might be a different issue.

Pandas tries to sum all elements and then tries to convert it to numbers. For strings "42", "42", "42" this would be "42" + "42" + "42" which is then converted to 424242.0 and divided by 3. Does this really make sense for a default df.mean() operation on dataframes? It basically means that you start a pretty computation intensive operation if you try to calculate a mean on a dataset with strings in one column.

n = 3
df = pd.DataFrame({
    "x":  np.random.normal(0.0, 1.0, n),
    "s": ["42"] * n,
})
df.mean()

The example above returns:

x        -0.554223
s    141414.000000
dtype: float64

Another example which causes a slowdown because the string needs to be concatenated before Pandas notices that this is not a number:

n = 100000
df = pd.DataFrame({
    "x":  np.random.normal(0.0, 1.0, n),
    "s": ["Some other longer string which causes troubles"] * n,
})
%timeit df.mean()
# 25.5 s ± 1.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# The column s is omitted in the result:
df.mean()
# x   -0.000547
# dtype: float64

All examples above tested on current master a43c42c.

lumbric · 2021-05-05T16:58:22Z

Note that there is also a Stackoverlfow Q&A about this.

lumbric · 2021-05-05T17:10:33Z

One more thing: the minimal example seems to be fast on the master, but why is the column d not in the result? To avoid breaking changes in the current version?

import numpy as np
import pandas as pd

# A DataFrame containing a datetime64 column
n = 1000000
df = pd.DataFrame({
    "x": np.random.normal(0.0, 1.0, n),
    "d": pd.date_range(pd.datetime.today(), periods=n, freq="1H").tolist()
})

df["d"].mean()
#  Timestamp('2078-05-20 02:33:45.791996416')

df.mean()
# <ipython-input-4-c61f0c8f89b5>:1: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a future version.
# x    0.000526
# dtype: float64

pd.__version__
#  '1.3.0.dev0+1544.ga43c42c32d'

jbrockmendel · 2021-05-14T02:58:18Z

but it's still broken if the dataframe contains strings

yah, there are a bunch of issues about this, look for tags on "nuisance columns". This shouldn't be that hard to fix, just needs someone to make it a priority. Contributions welcome.

Also I am not sure if the column d should be missing in the result for the example with dates.

It should for now. That behavior is deprecated though, so the dt64 columns will be included once that deprecation is enforced.

mroeschke · 2021-07-27T04:54:25Z

Since the original issue seems fixed, could use an asv benchmark.

jbrockmendel · 2023-03-02T23:31:04Z

Relevant benchmark would go in asv_bench.benchmarks.stat_ops possibly part of or adjacent to FrameOps. Based on a quick pass I don't see any mixed-dtype reduction asvs

jorisvandenbossche changed the title ~~DataFrame.mean() is slow with datetime columns~~ PERF: DataFrame.mean() is slow with datetime columns Jan 16, 2020

jorisvandenbossche added the Performance Memory or execution speed performance label Jan 16, 2020

jbrockmendel added Numeric Operations Arithmetic, Comparison, and Logical operations Reduction Operations sum, mean, min, max, etc. labels Sep 17, 2020

mroeschke added Benchmark Performance (ASV) benchmarks and removed Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance labels Jul 27, 2021

quangngd mentioned this issue Nov 13, 2023

ASV: add FrameMixedDtypesOps #55939

Merged

2 tasks

mroeschke closed this as completed in #55939 Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: DataFrame.mean() is slow with datetime columns #31075

PERF: DataFrame.mean() is slow with datetime columns #31075

richierocks commented Jan 16, 2020 •

edited by jbrockmendel

Loading

INSTALLED VERSIONS

MarcoGorelli commented Jan 16, 2020

jbrockmendel commented Jan 16, 2020

jorisvandenbossche commented Jan 16, 2020

jbrockmendel commented Jan 16, 2020

lumbric commented May 5, 2021 •

edited

Loading

lumbric commented May 5, 2021 •

edited

Loading

lumbric commented May 5, 2021

lumbric commented May 5, 2021

jbrockmendel commented May 14, 2021

mroeschke commented Jul 27, 2021

jbrockmendel commented Mar 2, 2023

PERF: DataFrame.mean() is slow with datetime columns #31075

PERF: DataFrame.mean() is slow with datetime columns #31075

Comments

richierocks commented Jan 16, 2020 • edited by jbrockmendel Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MarcoGorelli commented Jan 16, 2020

jbrockmendel commented Jan 16, 2020

jorisvandenbossche commented Jan 16, 2020

jbrockmendel commented Jan 16, 2020

lumbric commented May 5, 2021 • edited Loading

Seems to be improved on master

Result wrong too?

Does numeric_only=True help?

What does numeric_only do actually?

lumbric commented May 5, 2021 • edited Loading

lumbric commented May 5, 2021

lumbric commented May 5, 2021

jbrockmendel commented May 14, 2021

mroeschke commented Jul 27, 2021

jbrockmendel commented Mar 2, 2023

richierocks commented Jan 16, 2020 •

edited by jbrockmendel

Loading

Output of `pd.show_versions()`

lumbric commented May 5, 2021 •

edited

Loading

Does `numeric_only=True` help?

What does `numeric_only` do actually?

lumbric commented May 5, 2021 •

edited

Loading