Skip to content

Time-based .rolling() fails with .groupby() #13966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chrisaycock opened this issue Aug 11, 2016 · 19 comments · Fixed by #16091
Closed

Time-based .rolling() fails with .groupby() #13966

chrisaycock opened this issue Aug 11, 2016 · 19 comments · Fixed by #16091
Labels
Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@chrisaycock
Copy link
Contributor

chrisaycock commented Aug 11, 2016

Starting with this example:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

I can easily compute the rolling mean by identifier:

In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
   7      5.5
   8      6.5
   9      7.5
         ...
2  30    28.5
   31    29.5
3  32     NaN
   33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64

Now I want to add a timestamp column:

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

The timestamps are ordered within each identifier, but pandas complains:

In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic

Re-sorting leads to a different error:

In [26]: df.sort_values('B', inplace=True)

In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None

But we know that these column names are valid:

n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0      0.000000
20    10.000000
1      7.000000
21    10.500000
2      8.800000
22    11.000000
3      9.857143
23    11.500000
4     10.857143
24    12.500000
        ...
35    24.714286
15    23.500000
36    25.714286
16    24.500000
37    26.714286
17    25.500000
38    27.714286
18    26.500000
19    25.857143
39    27.500000
Name: C, dtype: float64

It seems like a bug that time-based .rolling() does not work with .groupby().

@jorisvandenbossche
Copy link
Member

Agree that this should work!

@hueykwik
Copy link

Is there a good workaround?

semio added a commit to semio/ddf_utils that referenced this issue Dec 13, 2016
related issue: #25

Note: there is a bug using groupby with rolling on specific column for now, so
we are not using the `on` parameter in rolling.
pandas-dev/pandas#13966
@ShashankBharadwaj
Copy link

Is the any way to do time aware rolling with group by for now before the new pandas release?

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

@ShashankBharadwaj this is not fixed, so new release or not is not going to matter.

@jreback jreback added this to the Next Major Release milestone Dec 20, 2016
@zscholl
Copy link

zscholl commented Apr 20, 2017

I stumbled on this yesterday as I was trying to solve the same problem.

I found a workaround, it's definitely not efficient, but it works.

import pandas as pd

import numpy as np

%load_ext watermark

%watermark -v -m -p pandas,numpy
CPython 3.5.1
IPython 4.2.0

pandas 0.19.2
numpy 1.11.0

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

# load up the example dataframe
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

# sort on the datetime column
df.sort_values('B', inplace=True)

#group it
grouped = df.groupby('A')

# tmp array to hold frames
frames = []

for group in grouped.groups:
    frame = grouped.get_group(group)
    frame['avg'] = frame.rolling('4s', on='B').C.mean()
    frames.append(frame)

pd.concat(frames)
Out[18]: 
    A                   B   C   avg
0   1 2016-01-01 09:30:00   0   0.0
1   1 2016-01-01 09:30:01   1   0.5
2   1 2016-01-01 09:30:02   2   1.0
3   1 2016-01-01 09:30:03   3   1.5
4   1 2016-01-01 09:30:04   4   2.5
5   1 2016-01-01 09:30:05   5   3.5
6   1 2016-01-01 09:30:06   6   4.5
7   1 2016-01-01 09:30:07   7   5.5
8   1 2016-01-01 09:30:08   8   6.5
9   1 2016-01-01 09:30:09   9   7.5
10  1 2016-01-01 09:30:10  10   8.5
11  1 2016-01-01 09:30:11  11   9.5
12  1 2016-01-01 09:30:12  12  10.5
13  1 2016-01-01 09:30:13  13  11.5
14  1 2016-01-01 09:30:14  14  12.5
15  1 2016-01-01 09:30:15  15  13.5
16  1 2016-01-01 09:30:16  16  14.5
17  1 2016-01-01 09:30:17  17  15.5
18  1 2016-01-01 09:30:18  18  16.5
19  1 2016-01-01 09:30:19  19  17.5
20  2 2016-01-01 09:30:00  20  20.0
21  2 2016-01-01 09:30:01  21  20.5
22  2 2016-01-01 09:30:02  22  21.0
23  2 2016-01-01 09:30:03  23  21.5
24  2 2016-01-01 09:30:04  24  22.5
25  2 2016-01-01 09:30:05  25  23.5
26  2 2016-01-01 09:30:06  26  24.5
27  2 2016-01-01 09:30:07  27  25.5
28  2 2016-01-01 09:30:08  28  26.5
29  2 2016-01-01 09:30:09  29  27.5
30  2 2016-01-01 09:30:10  30  28.5
31  2 2016-01-01 09:30:11  31  29.5
32  3 2016-01-01 09:30:12  32  32.0
33  3 2016-01-01 09:30:13  33  32.5
34  3 2016-01-01 09:30:14  34  33.0
35  3 2016-01-01 09:30:15  35  33.5
36  3 2016-01-01 09:30:16  36  34.5
37  3 2016-01-01 09:30:17  37  35.5
38  3 2016-01-01 09:30:18  38  36.5
39  3 2016-01-01 09:30:19  39  37.5

Hope this helps anyone in the meantime before a bug fix is provided. I haven't contributed to pandas yet, but having used it so much, maybe it's about time :)

@chrisaycock
Copy link
Contributor Author

@zscholl Thanks for your workaround. A bug fixes would be great if you're offering.

@zscholl
Copy link

zscholl commented Apr 20, 2017

I'll take a look at it in the coming weeks, @chrisaycock and see what I can do!

@jreback
Copy link
Contributor

jreback commented Apr 20, 2017

so this actually works on master now. fixed by #15694 (this will make sure that when sorting a multi-index it actually IS sorting it, previously it would not guarantee monotonic levels, only lexsortedness).

so this just needs tests

In [7]: pd.options.display.max_rows=12

In [8]: pd.__version__
Out[8]: '0.19.0+829.gb17e286'

In [9]: dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
   ...: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
   ...:                    'B': np.concatenate((dates, dates)),
   ...:                    'C': np.arange(40)})
   ...: 

In [10]: df.groupby('A').rolling('4s', on='B').C.mean()
Out[10]: 
A  B                  
1  2016-01-01 09:30:00     0.0
   2016-01-01 09:30:01     0.5
   2016-01-01 09:30:02     1.0
   2016-01-01 09:30:03     1.5
   2016-01-01 09:30:04     2.5
   2016-01-01 09:30:05     3.5
                          ... 
3  2016-01-01 09:30:14    33.0
   2016-01-01 09:30:15    33.5
   2016-01-01 09:30:16    34.5
   2016-01-01 09:30:17    35.5
   2016-01-01 09:30:18    36.5
   2016-01-01 09:30:19    37.5
Name: C, Length: 40, dtype: float64

jreback added a commit to jreback/pandas that referenced this issue Apr 22, 2017
@jreback jreback modified the milestones: 0.20.0, Next Major Release Apr 22, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 22, 2017
jreback added a commit that referenced this issue Apr 22, 2017
pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017
@sronen71
Copy link

pandas 0.21.0
This works ok as above:
df.groupby('A').rolling('4s', on='B').C.mean()
But this doesn't:
df.groupby('A').rolling('4s', on='B',closed='left').C.mean()
Gives error:

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 176, in getattr
return self[attr]
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 255, in getitem
return self._gotitem(key, ndim=2)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1322, in _gotitem
return super(RollingGroupby, self)._gotitem(key, ndim, subset=subset)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/base.py", line 678, in _gotitem
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 697, in init
super(GroupByMixin, self).init(obj, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 86, in init
self.validate()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/window.py", line 1124, in validate
raise ValueError("closed only implemented for datetimelike "
ValueError: closed only implemented for datetimelike and offset based windows

@guoliy
Copy link

guoliy commented Apr 27, 2018

The same problem with @sronen71. Has it fixed?

@morganics
Copy link

morganics commented May 22, 2018

Additionally, when using Dask;

ddf['K'] = ddf[['A', 'B', 'C']].groupby(by='A').apply(lambda x: x.rolling('90d', on='B')['C'].sum(), meta=('K', 'float64'))

Returns the ValueError: B must be monotonic error. The groupby is done in Dask, but the rolling is in Pandas land.

@lycanthropes
Copy link

how about the progress?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 14, 2019 via email

@jreback
Copy link
Contributor

jreback commented Mar 14, 2019

this issue is fixed

@lycanthropes
Copy link

lycanthropes commented Mar 14, 2019

What I am looking for is to do monthly rolling. Now pandas still does not work with this issue. Such as:
df.groupby('stock_code')['ret_daily'].rolling(window='3M').std()

@lycanthropes
Copy link

I can't use window='90D' instead, because sometimes it means a look-ahead bias in strategy backtesting.

@GeorgievF
Copy link

This issue is not perfectly solved, as it does gives the same error of chrisaycock when a Multi-index is formed of the groupby object and the datetime one

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40),})
df.set_index(['A', 'B'], inplace=True)
df.groupby('A').rolling('4s', on='B').C.mean()

@jreback
Copy link
Contributor

jreback commented Feb 8, 2021

@GeorgievF pls file a new issue with a reproducible example and using latest pandas

@GeorgievF
Copy link

Done!
#39686

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.