Time-based .rolling() fails with .groupby() #13966

chrisaycock · 2016-08-11T17:26:11Z

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

I can easily compute the rolling mean by identifier:

In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
   7      5.5
   8      6.5
   9      7.5
         ...
2  30    28.5
   31    29.5
3  32     NaN
   33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64

Now I want to add a timestamp column:

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

The timestamps are ordered within each identifier, but pandas complains:

In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic

Re-sorting leads to a different error:

In [26]: df.sort_values('B', inplace=True)

In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None

But we know that these column names are valid:

n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0      0.000000
20    10.000000
1      7.000000
21    10.500000
2      8.800000
22    11.000000
3      9.857143
23    11.500000
4     10.857143
24    12.500000
        ...
35    24.714286
15    23.500000
36    25.714286
16    24.500000
37    26.714286
17    25.500000
38    27.714286
18    26.500000
19    25.857143
39    27.500000
Name: C, dtype: float64

It seems like a bug that time-based .rolling() does not work with .groupby().

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-08-11T17:51:55Z

Agree that this should work!

hueykwik · 2016-11-26T18:48:23Z

Is there a good workaround?

related issue: #25 Note: there is a bug using groupby with rolling on specific column for now, so we are not using the `on` parameter in rolling. pandas-dev/pandas#13966

ShashankBharadwaj · 2016-12-20T14:30:50Z

Is the any way to do time aware rolling with group by for now before the new pandas release?

jreback · 2016-12-20T15:53:57Z

@ShashankBharadwaj this is not fixed, so new release or not is not going to matter.

zscholl · 2017-04-20T13:02:52Z

I stumbled on this yesterday as I was trying to solve the same problem.

I found a workaround, it's definitely not efficient, but it works.

import pandas as pd

import numpy as np

%load_ext watermark

%watermark -v -m -p pandas,numpy
CPython 3.5.1
IPython 4.2.0

pandas 0.19.2
numpy 1.11.0

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

# load up the example dataframe
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

# sort on the datetime column
df.sort_values('B', inplace=True)

#group it
grouped = df.groupby('A')

# tmp array to hold frames
frames = []

for group in grouped.groups:
    frame = grouped.get_group(group)
    frame['avg'] = frame.rolling('4s', on='B').C.mean()
    frames.append(frame)

pd.concat(frames)
Out[18]: 
    A                   B   C   avg
0   1 2016-01-01 09:30:00   0   0.0
1   1 2016-01-01 09:30:01   1   0.5
2   1 2016-01-01 09:30:02   2   1.0
3   1 2016-01-01 09:30:03   3   1.5
4   1 2016-01-01 09:30:04   4   2.5
5   1 2016-01-01 09:30:05   5   3.5
6   1 2016-01-01 09:30:06   6   4.5
7   1 2016-01-01 09:30:07   7   5.5
8   1 2016-01-01 09:30:08   8   6.5
9   1 2016-01-01 09:30:09   9   7.5
10  1 2016-01-01 09:30:10  10   8.5
11  1 2016-01-01 09:30:11  11   9.5
12  1 2016-01-01 09:30:12  12  10.5
13  1 2016-01-01 09:30:13  13  11.5
14  1 2016-01-01 09:30:14  14  12.5
15  1 2016-01-01 09:30:15  15  13.5
16  1 2016-01-01 09:30:16  16  14.5
17  1 2016-01-01 09:30:17  17  15.5
18  1 2016-01-01 09:30:18  18  16.5
19  1 2016-01-01 09:30:19  19  17.5
20  2 2016-01-01 09:30:00  20  20.0
21  2 2016-01-01 09:30:01  21  20.5
22  2 2016-01-01 09:30:02  22  21.0
23  2 2016-01-01 09:30:03  23  21.5
24  2 2016-01-01 09:30:04  24  22.5
25  2 2016-01-01 09:30:05  25  23.5
26  2 2016-01-01 09:30:06  26  24.5
27  2 2016-01-01 09:30:07  27  25.5
28  2 2016-01-01 09:30:08  28  26.5
29  2 2016-01-01 09:30:09  29  27.5
30  2 2016-01-01 09:30:10  30  28.5
31  2 2016-01-01 09:30:11  31  29.5
32  3 2016-01-01 09:30:12  32  32.0
33  3 2016-01-01 09:30:13  33  32.5
34  3 2016-01-01 09:30:14  34  33.0
35  3 2016-01-01 09:30:15  35  33.5
36  3 2016-01-01 09:30:16  36  34.5
37  3 2016-01-01 09:30:17  37  35.5
38  3 2016-01-01 09:30:18  38  36.5
39  3 2016-01-01 09:30:19  39  37.5

Hope this helps anyone in the meantime before a bug fix is provided. I haven't contributed to pandas yet, but having used it so much, maybe it's about time :)

chrisaycock · 2017-04-20T13:21:52Z

@zscholl Thanks for your workaround. A bug fixes would be great if you're offering.

zscholl · 2017-04-20T13:29:03Z

I'll take a look at it in the coming weeks, @chrisaycock and see what I can do!

jreback · 2017-04-20T13:43:55Z

so this actually works on master now. fixed by #15694 (this will make sure that when sorting a multi-index it actually IS sorting it, previously it would not guarantee monotonic levels, only lexsortedness).

so this just needs tests

In [7]: pd.options.display.max_rows=12

In [8]: pd.__version__
Out[8]: '0.19.0+829.gb17e286'

In [9]: dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
   ...: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
   ...:                    'B': np.concatenate((dates, dates)),
   ...:                    'C': np.arange(40)})
   ...: 

In [10]: df.groupby('A').rolling('4s', on='B').C.mean()
Out[10]: 
A  B                  
1  2016-01-01 09:30:00     0.0
   2016-01-01 09:30:01     0.5
   2016-01-01 09:30:02     1.0
   2016-01-01 09:30:03     1.5
   2016-01-01 09:30:04     2.5
   2016-01-01 09:30:05     3.5
                          ... 
3  2016-01-01 09:30:14    33.0
   2016-01-01 09:30:15    33.5
   2016-01-01 09:30:16    34.5
   2016-01-01 09:30:17    35.5
   2016-01-01 09:30:18    36.5
   2016-01-01 09:30:19    37.5
Name: C, Length: 40, dtype: float64

closes pandas-dev#13966 xref to pandas-dev#15130, closed by pandas-dev#15175

closes #13966 xref to #15130, closed by #15175

closes pandas-dev#13966 xref to pandas-dev#15130, closed by pandas-dev#15175

sronen71 · 2017-12-16T17:46:41Z

pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 176, in getattr
return self[attr]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 255, in getitem
return self._gotitem(key, ndim=2)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1322, in _gotitem
return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 678, in _gotitem
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 697, in init
super(GroupByMixin, self).init(obj, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 86, in init
self.validate()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1124, in validate
raise ValueError("closed only implemented for datetimelike "
ValueError: closed only implemented for datetimelike and offset based windows

guoliy · 2018-04-27T08:24:40Z

The same problem with @sronen71. Has it fixed?

morganics · 2018-05-22T09:13:59Z

Additionally, when using Dask;

ddf['K'] = ddf[['A', 'B', 'C']].groupby(by='A').apply(lambda x: x.rolling('90d', on='B')['C'].sum(), meta=('K', 'float64'))

Returns the ValueError: B must be monotonic error. The groupby is done in Dask, but the rolling is in Pandas land.

lycanthropes · 2019-03-14T10:36:03Z

how about the progress?

TomAugspurger · 2019-03-14T10:46:52Z

Probably want to hope a new issue with a minimal example if there are problems.

…

On Thu, Mar 14, 2019 at 5:36 AM lycanthropes ***@***.***> wrote: how about the progress? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#13966 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIvyTz3o5rB64DCox66u4MRJ9J5qcks5vWiYdgaJpZM4JiWlY> .

jreback · 2019-03-14T11:03:27Z

this issue is fixed

lycanthropes · 2019-03-14T11:10:47Z

What I am looking for is to do monthly rolling. Now pandas still does not work with this issue. Such as:
df.groupby('stock_code')['ret_daily'].rolling(window='3M').std()

lycanthropes · 2019-03-14T11:13:10Z

I can't use window='90D' instead, because sometimes it means a look-ahead bias in strategy backtesting.

GeorgievF · 2021-02-08T20:12:52Z

This issue is not perfectly solved, as it does gives the same error of chrisaycock when a Multi-index is formed of the groupby object and the datetime one

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40),})
df.set_index(['A', 'B'], inplace=True)
df.groupby('A').rolling('4s', on='B').C.mean()

jreback · 2021-02-08T20:17:06Z

@GeorgievF pls file a new issue with a reproducible example and using latest pandas

GeorgievF · 2021-02-08T20:53:19Z

Done!
#39686

jorisvandenbossche added Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Aug 11, 2016

chrisaycock mentioned this issue Aug 16, 2016

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

Closed

semio added a commit to semio/ddf_utils that referenced this issue Dec 13, 2016

new procedure: windows

3b14540

related issue: #25 Note: there is a bug using groupby with rolling on specific column for now, so we are not using the `on` parameter in rolling. pandas-dev/pandas#13966

jreback added this to the Next Major Release milestone Dec 20, 2016

jreback added Difficulty Intermediate labels Dec 20, 2016

jreback added a commit to jreback/pandas that referenced this issue Apr 22, 2017

BUG: groupby-rolling with a timedelta

e28d07e

closes pandas-dev#13966 xref to pandas-dev#15130, closed by pandas-dev#15175

jreback modified the milestones: 0.20.0, Next Major Release Apr 22, 2017

jreback mentioned this issue Apr 22, 2017

BUG: groupby-rolling with a timedelta #16091

Merged

jreback added a commit to jreback/pandas that referenced this issue Apr 22, 2017

BUG: groupby-rolling with a timedelta

b30a50f

closes pandas-dev#13966 xref to pandas-dev#15130, closed by pandas-dev#15175

jreback closed this as completed in #16091 Apr 22, 2017

jreback added a commit that referenced this issue Apr 22, 2017

BUG: groupby-rolling with a timedelta (#16091)

f0bd908

closes #13966 xref to #15130, closed by #15175

pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017

BUG: groupby-rolling with a timedelta (pandas-dev#16091)

a66a612

closes pandas-dev#13966 xref to pandas-dev#15130, closed by pandas-dev#15175

GeorgievF mentioned this issue Feb 8, 2021

DOC: Remove mention that on with MultiIndex is supported with Rolling Groupby #39686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time-based .rolling() fails with .groupby() #13966

Time-based .rolling() fails with .groupby() #13966

chrisaycock commented Aug 11, 2016 •

edited

Loading

jorisvandenbossche commented Aug 11, 2016

hueykwik commented Nov 26, 2016

ShashankBharadwaj commented Dec 20, 2016

jreback commented Dec 20, 2016

zscholl commented Apr 20, 2017 •

edited

Loading

chrisaycock commented Apr 20, 2017

zscholl commented Apr 20, 2017

jreback commented Apr 20, 2017

sronen71 commented Dec 16, 2017

guoliy commented Apr 27, 2018

morganics commented May 22, 2018 •

edited

Loading

lycanthropes commented Mar 14, 2019

TomAugspurger commented Mar 14, 2019 via email

jreback commented Mar 14, 2019

lycanthropes commented Mar 14, 2019 •

edited

Loading

lycanthropes commented Mar 14, 2019

GeorgievF commented Feb 8, 2021

jreback commented Feb 8, 2021

GeorgievF commented Feb 8, 2021

Time-based .rolling() fails with .groupby() #13966

Time-based .rolling() fails with .groupby() #13966

Comments

chrisaycock commented Aug 11, 2016 • edited Loading

jorisvandenbossche commented Aug 11, 2016

hueykwik commented Nov 26, 2016

ShashankBharadwaj commented Dec 20, 2016

jreback commented Dec 20, 2016

zscholl commented Apr 20, 2017 • edited Loading

chrisaycock commented Apr 20, 2017

zscholl commented Apr 20, 2017

jreback commented Apr 20, 2017

sronen71 commented Dec 16, 2017

guoliy commented Apr 27, 2018

morganics commented May 22, 2018 • edited Loading

lycanthropes commented Mar 14, 2019

TomAugspurger commented Mar 14, 2019 via email

jreback commented Mar 14, 2019

lycanthropes commented Mar 14, 2019 • edited Loading

lycanthropes commented Mar 14, 2019

GeorgievF commented Feb 8, 2021

jreback commented Feb 8, 2021

GeorgievF commented Feb 8, 2021

chrisaycock commented Aug 11, 2016 •

edited

Loading

zscholl commented Apr 20, 2017 •

edited

Loading

morganics commented May 22, 2018 •

edited

Loading

lycanthropes commented Mar 14, 2019 •

edited

Loading