Rolling sum (closed='left') with duplicate Timestamp indices. #20712

avivbd · 2018-04-16T13:08:14Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

n = 5
t1 = pd.date_range('1/1/2011', periods=n, freq='H')
v1 = range(n)

t2 = pd.date_range('1/1/2011', periods=n, freq='H')
v2 = range(n, 2*n)

index = np.sort(np.hstack([t1.values, t2.values]))
data = np.sort(np.hstack([v1, v2]))
s = pd.Series(data=data, index=index)
print(s)

2011-01-01 00:00:00 0
2011-01-01 00:00:00 1
2011-01-01 01:00:00 2
2011-01-01 01:00:00 3
2011-01-01 02:00:00 4
2011-01-01 02:00:00 5
2011-01-01 03:00:00 6
2011-01-01 03:00:00 7
2011-01-01 04:00:00 8
2011-01-01 04:00:00 9

Unexpected behavior: rolling treats duplicate times as consecutive

four_hours = pd.Timedelta('4h')
g = s.rolling(four_hours,closed='left').sum()
print(g)
2011-01-01 00:00:00     NaN
2011-01-01 00:00:00     0.0
2011-01-01 01:00:00     1.0
2011-01-01 01:00:00     3.0
2011-01-01 02:00:00     6.0
2011-01-01 02:00:00    10.0
2011-01-01 03:00:00    15.0
2011-01-01 03:00:00    21.0
2011-01-01 04:00:00    28.0
2011-01-01 04:00:00    36.0

Problem description

Rolling treats duplicate times as consecutive rather than equal.

Expected Output

gg = s.groupby(s.index).sum().rolling(four_hours, closed='left').sum()
ggg = pd.concat([gg, gg]).sort_index()
print(ggg)

2011-01-01 00:00:00     NaN
2011-01-01 00:00:00     NaN
2011-01-01 01:00:00     1.0
2011-01-01 01:00:00     1.0
2011-01-01 02:00:00     6.0
2011-01-01 02:00:00     6.0
2011-01-01 03:00:00    15.0
2011-01-01 03:00:00    15.0
2011-01-01 04:00:00    28.0
2011-01-01 04:00:00    28.0

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.1
pytest: 3.4.0
pip: 9.0.3
setuptools: 36.6.0
Cython: None
numpy: 1.13.3
scipy: 0.19.0
xarray: 0.9.5
IPython: 5.5.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.0
pandas_gbq: None
pandas_datareader: None
import sys; print('Python %s on %s' % (sys.version, sys.platform))

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-04-16T22:57:25Z

I'm not sure I agree with your expected output - is there a generalizable use case you are trying to cover here?

avivbd · 2018-04-17T03:48:38Z

Let me be more explicit in my logic. I have events (indexed by time) and each event has a duration. For every event I'm interested in calculating the sum of the durations of all preceding events over a four hour window. I am also interested in excluding the duration of the present event from that running sum. Thus, I use the rolling method with closed='left'.

Some events occur simultaneously, but they are not identical as they have different durations. Yet, they share the same prior events, and thus should have the same total duration of all preceding events.

However, this is not the result that I get when I apply rolling. I get that simultaneously occurring events are treated as consecutive. If I switched their order I would get a different result, which is absurd. The total duration of all events preceding two simultaneous events should be the same, and the calculation should reflect that.

jiehuan · 2018-10-25T15:58:19Z

I have the same issue here. The 'closed' parameter is not really working when the value/date is duplicated.

Input:
index date amount
323081 2016-06-30 26,238.600
398994 2016-06-30 17,044.800
405077 2016-08-09 549.900
440495 2016-09-29 1,625.400
636274 2017-06-30 27,453.600
638552 2017-06-30 19,816.800

rolling(window='90d', on='date', closed='neither')

Output:
index amount
323081 nan
398994 26,238.600
405077 43,283.400
440495 549.900
636274 nan
638552 27,453.600

Expected output:
index amount
323081 nan
398994 nan
405077 43,283.400
440495 549.900
636274 nan
638552 nan

vtoliveira · 2019-03-08T21:25:33Z

I am having the same problem, have anyone discussed it already?

cvheerden · 2019-05-10T17:18:21Z

Same issue here. Multiple entries on the same day, and would like to calculate the sum of the last 30 days, EXCLUDING the entries on that day (having an open right parameter here). If I have two entries on 2019-05-10, I would then expect their 30 day rolling sums to be equal, but right now the second entry on 2019-05-10's 30d sum includes the previous entry on 2019-05-10.

ramondalmau · 2019-05-17T07:27:10Z

Same issue here. Would really appreciate to find a workaround/solution

TomAugspurger · 2019-05-17T14:05:31Z

This does look like a bug. Let us know if you're interested in investigating.

…

On Fri, May 17, 2019 at 2:27 AM ramondalmau ***@***.***> wrote: Same issue here. Would really appreciate to find a workaround/solution — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#20712?email_source=notifications&email_token=AAKAOIQLLR6VFKFV2UAPZUDPVZM5TA5CNFSM4E2Z63M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVT7GNA#issuecomment-493351732>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIQJG7PGS6BFRDWXKETPVZM5TANCNFSM4E2Z63MQ> .

e05wz · 2019-06-08T14:55:13Z

Same issue as explained in this comment. I think the original behavior also has some use scenarios so it would be better to add a new keyword argument to support this expected behaviour.

mateusccoelho · 2021-02-26T01:32:50Z

I'm also facing this problem. In my case I want to compute the number of unique values in each window. The workaround proposed by OP doesn't work because I have to use the unique function in the groupby, which returns lists. I can create a lambda that counts the number of unique values in the lists, but it happens that rolling().apply() doesn't support object columns. So, currently there's no way of doing that without using two for loop O(n^2).

romanovzky · 2022-03-08T16:00:08Z

I would also like the feature whereas setting closed="left" can provide the aggregation all entries prior when we have collision dates.

mroeschke added Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding labels Jan 13, 2019

ghost mentioned this issue Jun 26, 2019

rolling(offset) treats duplicated Timestamps inconsistently between closed=left/right; #27057

Closed

mroeschke added Bug and removed Needs Discussion Requires discussion from core team before further action labels May 8, 2020

mroeschke mentioned this issue Oct 5, 2020

'closed' parameter is not working for Rolling with duplicate Timestamp #23335

Closed

rosagold mentioned this issue Oct 9, 2021

BUG: rolling on centered datetimelike windows disrespect duplicate indices #43944

Closed

3 tasks

ihsansecer mentioned this issue Sep 1, 2023

BUG: Fix Rolling where duplicate datetimelike indexes are treated as consecutive rather than equal with closed='left' and closed='neither' #54917

Merged

5 tasks

mroeschke closed this as completed in #54917 Sep 5, 2023

bnavigator added a commit to bnavigator/pandas that referenced this issue Sep 7, 2023

Add whatsnew for v2.2.0 fixing pandas-devgh-20712

db63e06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling sum (closed='left') with duplicate Timestamp indices. #20712

Rolling sum (closed='left') with duplicate Timestamp indices. #20712

avivbd commented Apr 16, 2018 •

edited by TomAugspurger

Loading

INSTALLED VERSIONS

WillAyd commented Apr 16, 2018

avivbd commented Apr 17, 2018

jiehuan commented Oct 25, 2018

vtoliveira commented Mar 8, 2019

cvheerden commented May 10, 2019

ramondalmau commented May 17, 2019

TomAugspurger commented May 17, 2019 via email

e05wz commented Jun 8, 2019 •

edited

Loading

mateusccoelho commented Feb 26, 2021 •

edited

Loading

romanovzky commented Mar 8, 2022

Rolling sum (closed='left') with duplicate Timestamp indices. #20712

Rolling sum (closed='left') with duplicate Timestamp indices. #20712

Comments

avivbd commented Apr 16, 2018 • edited by TomAugspurger Loading

Code Sample, a copy-pastable example if possible

Unexpected behavior: rolling treats duplicate times as consecutive

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Apr 16, 2018

avivbd commented Apr 17, 2018

jiehuan commented Oct 25, 2018

vtoliveira commented Mar 8, 2019

cvheerden commented May 10, 2019

ramondalmau commented May 17, 2019

TomAugspurger commented May 17, 2019 via email

e05wz commented Jun 8, 2019 • edited Loading

mateusccoelho commented Feb 26, 2021 • edited Loading

romanovzky commented Mar 8, 2022

avivbd commented Apr 16, 2018 •

edited by TomAugspurger

Loading

Output of `pd.show_versions()`

e05wz commented Jun 8, 2019 •

edited

Loading

mateusccoelho commented Feb 26, 2021 •

edited

Loading