Skip to content

Rolling sum (closed='left') with duplicate Timestamp indices. #20712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
avivbd opened this issue Apr 16, 2018 · 10 comments · Fixed by #54917
Closed

Rolling sum (closed='left') with duplicate Timestamp indices. #20712

avivbd opened this issue Apr 16, 2018 · 10 comments · Fixed by #54917
Labels
Bug Window rolling, ewma, expanding

Comments

@avivbd
Copy link

avivbd commented Apr 16, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

n = 5
t1 = pd.date_range('1/1/2011', periods=n, freq='H')
v1 = range(n)

t2 = pd.date_range('1/1/2011', periods=n, freq='H')
v2 = range(n, 2*n)

index = np.sort(np.hstack([t1.values, t2.values]))
data = np.sort(np.hstack([v1, v2]))
s = pd.Series(data=data, index=index)
print(s)

2011-01-01 00:00:00 0
2011-01-01 00:00:00 1
2011-01-01 01:00:00 2
2011-01-01 01:00:00 3
2011-01-01 02:00:00 4
2011-01-01 02:00:00 5
2011-01-01 03:00:00 6
2011-01-01 03:00:00 7
2011-01-01 04:00:00 8
2011-01-01 04:00:00 9

Unexpected behavior: rolling treats duplicate times as consecutive

four_hours = pd.Timedelta('4h')
g = s.rolling(four_hours,closed='left').sum()
print(g)
2011-01-01 00:00:00     NaN
2011-01-01 00:00:00     0.0
2011-01-01 01:00:00     1.0
2011-01-01 01:00:00     3.0
2011-01-01 02:00:00     6.0
2011-01-01 02:00:00    10.0
2011-01-01 03:00:00    15.0
2011-01-01 03:00:00    21.0
2011-01-01 04:00:00    28.0
2011-01-01 04:00:00    36.0

Problem description

Rolling treats duplicate times as consecutive rather than equal.

Expected Output

gg = s.groupby(s.index).sum().rolling(four_hours, closed='left').sum()
ggg = pd.concat([gg, gg]).sort_index()
print(ggg)
2011-01-01 00:00:00     NaN
2011-01-01 00:00:00     NaN
2011-01-01 01:00:00     1.0
2011-01-01 01:00:00     1.0
2011-01-01 02:00:00     6.0
2011-01-01 02:00:00     6.0
2011-01-01 03:00:00    15.0
2011-01-01 03:00:00    15.0
2011-01-01 04:00:00    28.0
2011-01-01 04:00:00    28.0

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.1
pytest: 3.4.0
pip: 9.0.3
setuptools: 36.6.0
Cython: None
numpy: 1.13.3
scipy: 0.19.0
xarray: 0.9.5
IPython: 5.5.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.0
pandas_gbq: None
pandas_datareader: None
import sys; print('Python %s on %s' % (sys.version, sys.platform))

@WillAyd
Copy link
Member

WillAyd commented Apr 16, 2018

I'm not sure I agree with your expected output - is there a generalizable use case you are trying to cover here?

@avivbd
Copy link
Author

avivbd commented Apr 17, 2018

Let me be more explicit in my logic. I have events (indexed by time) and each event has a duration. For every event I'm interested in calculating the sum of the durations of all preceding events over a four hour window. I am also interested in excluding the duration of the present event from that running sum. Thus, I use the rolling method with closed='left'.

Some events occur simultaneously, but they are not identical as they have different durations. Yet, they share the same prior events, and thus should have the same total duration of all preceding events.

However, this is not the result that I get when I apply rolling. I get that simultaneously occurring events are treated as consecutive. If I switched their order I would get a different result, which is absurd. The total duration of all events preceding two simultaneous events should be the same, and the calculation should reflect that.

@jiehuan
Copy link

jiehuan commented Oct 25, 2018

I have the same issue here. The 'closed' parameter is not really working when the value/date is duplicated.

Input:
index date amount
323081 2016-06-30 26,238.600
398994 2016-06-30 17,044.800
405077 2016-08-09 549.900
440495 2016-09-29 1,625.400
636274 2017-06-30 27,453.600
638552 2017-06-30 19,816.800

rolling(window='90d', on='date', closed='neither')

Output:
index amount
323081 nan
398994 26,238.600
405077 43,283.400
440495 549.900
636274 nan
638552 27,453.600

Expected output:
index amount
323081 nan
398994 nan
405077 43,283.400
440495 549.900
636274 nan
638552 nan

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding labels Jan 13, 2019
@vtoliveira
Copy link

I am having the same problem, have anyone discussed it already?

@cvheerden
Copy link

Same issue here. Multiple entries on the same day, and would like to calculate the sum of the last 30 days, EXCLUDING the entries on that day (having an open right parameter here). If I have two entries on 2019-05-10, I would then expect their 30 day rolling sums to be equal, but right now the second entry on 2019-05-10's 30d sum includes the previous entry on 2019-05-10.

@ramondalmau
Copy link

Same issue here. Would really appreciate to find a workaround/solution

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 17, 2019 via email

@e05wz
Copy link

e05wz commented Jun 8, 2019

Same issue as explained in this comment. I think the original behavior also has some use scenarios so it would be better to add a new keyword argument to support this expected behaviour.

@mateusccoelho
Copy link

mateusccoelho commented Feb 26, 2021

I'm also facing this problem. In my case I want to compute the number of unique values in each window. The workaround proposed by OP doesn't work because I have to use the unique function in the groupby, which returns lists. I can create a lambda that counts the number of unique values in the lists, but it happens that rolling().apply() doesn't support object columns. So, currently there's no way of doing that without using two for loop O(n^2).

@romanovzky
Copy link

I would also like the feature whereas setting closed="left" can provide the aggregation all entries prior when we have collision dates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet