Aggregate mean over TimeDelta column #17382

AlbertDeFusco · 2017-08-30T18:07:05Z

Cannot compute mean of timedelta in gropuby

In [1]: import pandas as pd
   ...: import numpy as np
   ...: print(pd.__version__)
   ...: 
   ...: time = pd.to_timedelta(np.random.randint(100,1000, size=10), unit='s')
   ...: cat = np.random.choice(['A','B','C'], size=10)
   ...: 
   ...: df = pd.DataFrame(dict(time=time, cat=cat))
   ...: 
   ...: print(df['time'].mean())
   ...: 
   ...: df.groupby('cat')['time'].mean()
   ...: 
0.20.3
0 days 00:07:15.800000
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-1-b649ebd78333> in <module>()
     10 print(df['time'].mean())
     11 
---> 12 df.groupby('cat')['time'].mean()

/Users/adefusco/Applications/anaconda3/4.3/envs/module-pandas/lib/python3.6/site-packages/pandas/core/groupby.py in mean(self, *args, **kwargs)
   1035         nv.validate_groupby_func('mean', args, kwargs, ['numeric_only'])
   1036         try:
-> 1037             return self._cython_agg_general('mean', **kwargs)
   1038         except GroupByError:
   1039             raise

/Users/adefusco/Applications/anaconda3/4.3/envs/module-pandas/lib/python3.6/site-packages/pandas/core/groupby.py in _cython_agg_general(self, how, alt, numeric_only)
    834 
    835         if len(output) == 0:
--> 836             raise DataError('No numeric types to aggregate')
    837 
    838         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

Problem description

The mean of TimeDelta can be computed, but it is not working in GroupBy.

Expected Output

This is the output I would expect.

In [3]: def average_time(x):
   ...:     s = x.dt.seconds
   ...:     return pd.Timedelta(s.mean(), unit='s')
   ...: 
   ...: df.groupby('cat')['time'].apply(average_time)
   ...: 
Out[3]: 
cat
A   00:09:27.666667
B   00:06:18.666667
C   00:06:19.750000
Name: time, dtype: timedelta64[ns]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-08-30T18:31:44Z

I think this is just because we don't have a cython implementation of group_mean in

pandas/pandas/_libs/groupby_helper.pxi.in

Line 31 in ad7d6fc

{{for name, c_type, dest_type, dest_type2, dest_dtype in get_dispatch(dtypes)}}

if you're interested in taking a look

FYI, a slightly smaller workaround is df.groupby('cat')['time'].agg(lambda x: x.mean())

TomAugspurger · 2017-08-30T18:34:39Z

Oh, it might be a bit more involved than that, since it looks like we call ._get_numeric_data, which excludes timedeltas.

chris-b1 · 2017-08-31T09:29:12Z

The cython path will handle timedeltas as below

In [47]: df.groupby('cat')['time'].mean(numeric_only=False)
Out[47]: 
cat
A   00:09:46.666666
B   00:10:02.750000
Name: time, dtype: timedelta64[ns]

Arguably not a great API - we automatically exclude "non-numeric" columns on groupby ops. This makes sense with an entire DataFrame, but perhaps on a Series selection this should default to true?

Holt59 · 2018-02-02T10:58:14Z

@chris-b1 Just tried this on my dataframe, and it does not give me correct results, I think it's because it handles NaT incorrectly (it gives me negative Timedelta from a dataframe containing only positive Timedelta and NaT).

MattW216 · 2018-08-15T12:14:59Z

This same issue is occurring when using .resample(). I'm guessing that uses GroupBy in some fashion, but in case it does not, I thought I'd mention it here.

Also, @chris-b1 's workaround worked just fine for me.

qinyugwu · 2018-12-26T20:44:18Z

It seems to make sense to convert timestamp column to either days or seconds by using the inbuilt functions to get the mean.

chbrown · 2019-02-28T18:41:08Z

FWIW (another workaround), you can use ✅ ...agg({'duration': pd.Series.mean}) instead of ❌ ...agg({'duration': 'mean'}) or ❌ ...agg({'duration': np.mean}) (where duration is a timedelta64-typed column).

t3ndai · 2020-03-21T02:33:15Z

this is still an issue for me too, left my head spinning as I was able to get the mean | median values using describe etc. but then when pivoting and group_by started to get numeric values not found error. Though solutions above are good workarounds

mroeschke · 2021-06-12T04:29:26Z

This look to work on master now. Could use a test

In [36]: In [1]: import pandas as pd
    ...:    ...: import numpy as np
    ...:    ...: print(pd.__version__)
    ...:    ...:
    ...:    ...: time = pd.to_timedelta(np.random.randint(100,1000, size=10), unit='s')
    ...:    ...: cat = np.random.choice(['A','B','C'], size=10)
    ...:    ...:
    ...:    ...: df = pd.DataFrame(dict(time=time, cat=cat))
    ...:    ...:
    ...:    ...: print(df['time'].mean())
    ...:    ...:
    ...:    ...: df.groupby('cat')['time'].mean()
1.3.0.dev0+1889.gf949788596
0 days 00:10:26.100000
Out[36]:
cat
A          0 days 00:13:24
B   0 days 00:09:36.500000
C          0 days 00:07:53
Name: time, dtype: timedelta64[ns]

TomAugspurger added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Difficulty Intermediate Groupby labels Aug 30, 2017

gfyoung mentioned this issue Jun 22, 2018

Support mean() for Timestamps. #21583

Closed

jchia mentioned this issue Oct 31, 2018

Dropping nuisance columns in groupby is a nuisance #21664

Closed

simonjayhawkins mentioned this issue Jun 19, 2019

pivot_table raises DataError: No numeric types to aggregate with Timedelta #26938

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

jbrockmendel added the Apply Apply, Aggregate, Transform, Map label Oct 31, 2019

jbrockmendel self-assigned this Oct 31, 2019

retsyo mentioned this issue Apr 13, 2020

BUG: DataError: No numeric types to aggregate during pd.pivot_table #33515

Closed

mroeschke added Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations Timedelta Timedelta data type and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Apply Apply, Aggregate, Transform, Map labels May 13, 2020

jbrockmendel added the Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply label Sep 23, 2020

mroeschke removed Enhancement Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply labels Jun 12, 2021

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Numeric Operations Arithmetic, Comparison, and Logical operations Timedelta Timedelta data type labels Jun 12, 2021

mroeschke mentioned this issue Oct 31, 2021

TST: Old issues #44245

Merged

9 tasks

jreback added this to the 1.4 milestone Oct 31, 2021

jreback added Groupby Timedelta Timedelta data type labels Oct 31, 2021

jreback closed this as completed in #44245 Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate mean over TimeDelta column #17382

Aggregate mean over TimeDelta column #17382

AlbertDeFusco commented Aug 30, 2017 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017

chris-b1 commented Aug 31, 2017

Holt59 commented Feb 2, 2018

MattW216 commented Aug 15, 2018

qinyugwu commented Dec 26, 2018

chbrown commented Feb 28, 2019

t3ndai commented Mar 21, 2020 •

edited

Loading

mroeschke commented Jun 12, 2021

Aggregate mean over TimeDelta column #17382

Aggregate mean over TimeDelta column #17382

Comments

AlbertDeFusco commented Aug 30, 2017 • edited Loading

Cannot compute mean of timedelta in gropuby

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Aug 30, 2017

TomAugspurger commented Aug 30, 2017

chris-b1 commented Aug 31, 2017

Holt59 commented Feb 2, 2018

MattW216 commented Aug 15, 2018

qinyugwu commented Dec 26, 2018

chbrown commented Feb 28, 2019

t3ndai commented Mar 21, 2020 • edited Loading

mroeschke commented Jun 12, 2021

AlbertDeFusco commented Aug 30, 2017 •

edited

Loading

Output of `pd.show_versions()`

t3ndai commented Mar 21, 2020 •

edited

Loading