Skip to content

ENH: allow exact matches in time-based .rolling() #13968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

ENH: allow exact matches in time-based .rolling() #13968

wants to merge 12 commits into from

Conversation

agraboso
Copy link
Contributor

TODO: update docs.

@agraboso agraboso changed the title EHN: allow exact matches in time-based .rolling() ENH: allow exact matches in time-based .rolling() Aug 11, 2016
# this will raise ValueError on non-fixed freqs
self.window = freq.nanos
if self.left_closed:
self.window += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, the logic adds one extra nanosecond to the search space?

Copy link
Contributor Author

@agraboso agraboso Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, offset-based rolling windows rely on the conversion to nanoseconds in the line right above, so adding one not only works but is the optimal strategy — it includes the left endpoint and nothing else (since nanosecond is the smallest time resolution available).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test or two with values at finer resolutions than the offset to confirm this? for example

In [11]: df = pd.DataFrame({'B': [1, 1, 1]},
    ...:                   index = [pd.Timestamp('20130101 09:00:30'),
    ...:                            pd.Timestamp('20130101 09:01:00'),
    ...:                            pd.Timestamp('20130101 09:02:00')])

In [12]: df.rolling('1Min').sum()
Out[12]: 
                       B
2013-01-01 09:00:30  1.0
2013-01-01 09:01:00  2.0
2013-01-01 09:02:00  1.0

Copy link
Contributor Author

@agraboso agraboso Aug 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisaycock
Copy link
Contributor

chrisaycock commented Aug 11, 2016

The code looks pretty good. We'll need documentation to explain what left_closed is (which you list in your TODO).

@agraboso
Copy link
Contributor Author

The code looks pretty good.

Thanks!

We'll need documentation to explain what left_closed is.

Yes, that'd be the lion's share of this.

I'd like to get the opinion of at least a couple of people about the word left_closed. I chose it because allow_exact_matches seemed long and non-obvious to me. But I fear that people not reading the documentation (which would explain that it only applies to offset-based windows) might misinterpret it. Another option I considered was include_start_time.

@chrisaycock
Copy link
Contributor

Yes, some sort of include_... might also work. I originally requested allow_exact_matches since that's the parameter in the new pd.merge_asof(), but nobody else seemed to like that suggestion.

@agraboso
Copy link
Contributor Author

I originally requested allow_exact_matches since that's the parameter in the new pd.merge_asof(), but nobody else seemed to like that suggestion.

It might make sense there (I haven't looked at pd.merge_asof() yet), but I find allow_exact_matches very obscure in this context.

@codecov-io
Copy link

codecov-io commented Aug 11, 2016

Current coverage is 84.65% (diff: 100%)

Merging #13968 into master will increase coverage by <.01%

@@             master     #13968   diff @@
==========================================
  Files           144        144          
  Lines         51016      51023     +7   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43184      43191     +7   
  Misses         7832       7832          
  Partials          0          0          

Powered by Codecov. Last update 3ccb501...62ff604

@@ -1044,18 +1055,26 @@ def validate(self):
"for datetimelike and offset "
"based windows")

if self.left_closed is not None and not is_bool(self.left_closed):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add tests for invalid input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks
Copy link
Member

sinhrks commented Aug 11, 2016

Pls also add tests with PeriodIndex and TimedeltaIndex (the latter should raise an error)

@@ -5382,11 +5382,12 @@ def _add_series_or_dataframe_operations(cls):
from pandas.core import window as rwindow

@Appender(rwindow.rolling.__doc__)
def rolling(self, window, min_periods=None, freq=None, center=False,
win_type=None, on=None, axis=0):
def rolling(self, window, min_periods=None, left_closed=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this after on as its a new parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

we have the convention in pd.date_range (and like)

closed : string or None, default None
    Make the interval closed with respect to the given frequency to
    the 'left', 'right', or both sides (None)

This makes more sense to me. default would be right

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate and removed Difficulty Intermediate labels Aug 11, 2016
@agraboso
Copy link
Contributor Author

we have the convention in pd.date_range

Are you speaking in favor of something different from left_closed?

@agraboso
Copy link
Contributor Author

@sinhrks

Pls also add tests with PeriodIndex and TimedeltaIndex (the latter should raise an error)

Not sure what you mean. Could you please be more precise?

@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

@agraboso no, I think we should stick with the spelling: closed='left'|'right'|'both'.

@agraboso
Copy link
Contributor Author

@jreback: the problem is that closed=left doesn't make sense with the current implementation, while closed=right is the default and closed=both is what we are trying to achieve.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 11, 2016

@jreback the issue with following that spelling of closed='left'|'right'|'both' is that left does not really makes sense in the case of rolling I think?
But if it makes sense, then it is indeed maybe good to follow that

EDIT: well, so what @agraboso said :-)

@jreback
Copy link
Contributor

jreback commented Aug 11, 2016

only allow right and both then

I think changing to a non standard argument is even more confusing

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 11, 2016

I proposed closed=True|False in #13965 (comment) to keep the keyword name, but closed='right'|'both' (without left) is then maybe better to keep the passed values also consistent.

@agraboso
Copy link
Contributor Author

agraboso commented Aug 12, 2016

I'm open to whatever the consensus might be on the name. Github doesn't have any voting system AFAIK, so I'll add comments below for the proposed options and ask people to vote for them. We can, of course, have a discussion afterwards.

@agraboso
Copy link
Contributor Author

Option: allow_exact_matches=False|True, default False. Please like to vote for this option.

@agraboso
Copy link
Contributor Author

Option: left_closed=False|True, default False. Please like to vote for this option.

@@ -377,6 +381,12 @@ class Window(_Window):

axis : int or string, default 0

.. versionadded:: 0.19.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> 0.20.0

@@ -455,6 +465,17 @@ class Window(_Window):
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0

For time-based windows, it is possible to make the resulting window
contain its left edge by setting closed='both'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show closed='right', for comparison

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

lgtm (slight doc correction). pls add a whatsnew for 0.20.0

cc @chrisaycock

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

@agraboso pls add a whatsnew for 0.20.0 (other enhancements)

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

can you rebase / update

@chrisaycock
Copy link
Contributor

Looking forward to having this added to 0.20.0. It just needs a whatsnew entry plus a fix to the "0.19.0" versionadded entry.

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

yes just needs minor changes. if @agraboso doesn't fix this up in next couple of days I'll fix on merge.

@jreback jreback added this to the 0.20.0 milestone Feb 27, 2017
@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

actually prob need some slight doc updates as well (doc-string, maybe in computation.rst)

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

@chrisaycock any other doc updates needed here? (e.g. maybe in merge.rst)?

@chrisaycock
Copy link
Contributor

It still needs the versionadded update, the whatsnew entry, and possibly something in computation.rst. I doubt there should be an entry in merge.rst.

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

@chrisaycock sorry, yes, meant computation.rst

@jreback
Copy link
Contributor

jreback commented Mar 14, 2017

@agraboso if you can update we can get this in.

@carlosdanielcsantos
Copy link
Contributor

I do think the closed='left' makes sense as well. I am developing a project where I am facing the problem of getting rolling time-window statistics for each row without including the current row in the calculations. For me to actual get to this result using the result of rolling() is becoming a nightmare so I was already thinking that changing some lines of code inside pandas would probably be a lot easier.

@jreback
Copy link
Contributor

jreback commented Mar 15, 2017

@carlosdanielcsantos so we have closed='right'|'both', are you advocating for adding closed='left' as well? (meaning NOT include the rhs endpoint).?

@carlosdanielcsantos
Copy link
Contributor

carlosdanielcsantos commented Mar 15, 2017

@jreback If by not including the rhs endpoint you mean not including the "right side" endpoint, yes. As in ti <= t < tf. For the sake of completeness, we could add the option to not include any endpoint at all, ti < t < tf, with something like closed='none'.

Here is a quick example of what it would look like

df = pd.DataFrame({'x': [1]*5},
....:                 index = [pd.Timestamp('20130101 09:00:01'),
....:                          pd.Timestamp('20130101 09:00:02'),
....:                          pd.Timestamp('20130101 09:00:03'),
....:                          pd.Timestamp('20130101 09:00:04'),
....:                          pd.Timestamp('20130101 09:00:06')])

df["right"] = df.rolling('2s', closed='right').x.sum()   # current behavior
df["both"] = df.rolling('2s', closed='both').x.sum()
df["left"] = df.rolling('2s', closed='left').x.sum()
df["open"] = df.rolling('2s', closed='none').x.sum()

print(df)
                     x  right  both  left  open
2013-01-01 09:00:01  1    1.0   1.0   NaN   NaN
2013-01-01 09:00:02  1    2.0   2.0   1.0   1.0
2013-01-01 09:00:03  1    2.0   3.0   2.0   1.0
2013-01-01 09:00:04  1    2.0   3.0   2.0   1.0
2013-01-01 09:00:06  1    1.0   2.0   1.0   NaN

@jreback
Copy link
Contributor

jreback commented Mar 15, 2017

@carlosdanielcsantos aside from none -> neither that might be reasonable.

@chrisaycock @agraboso thoughts?

@chrisaycock
Copy link
Contributor

I could definitely see closed as 'neither', 'left', 'right', or 'both'.

@jreback
Copy link
Contributor

jreback commented Mar 15, 2017

I think then we should directly push closed parameter to cython and change to handle these 4 different cases.

Its possible to do shifting on the window (as is done here), and actually shift the data (to make 'left' and 'neither' work), but I think this will be more transparent.

@carlosdanielcsantos
Copy link
Contributor

@jreback sounds good to me. I already started working on it and got the logic right on the code you pointed out. I hope @agraboso doesn't mind if I pick some lines from here to my branch.
Looking at it now, I don't find a reason why this feature should limit itself to time-based windows. Although you can get the 'both' case by increasing the window size, the 'left' and 'neither' cases would be a nice addition to situations where you want to ignore the right endpoint, which is currently not possible.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2017

@carlosdanielcsantos ok by me!

sure, on non-time based are ok as well. you'll want to add a sub-section to whatsnew showing this (as well as a similar section in the docs themselves & doc-strings).

@agraboso
Copy link
Contributor Author

@carlosdanielcsantos I don't mind at all. Hack away!

@jreback jreback removed this from the 0.20.0 milestone Mar 23, 2017
@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@carlosdanielcsantos any luck with this?

@carlosdanielcsantos
Copy link
Contributor

carlosdanielcsantos commented Mar 23, 2017

@jreback already wrote tests and the code on window.pyx. Now trying to figure out the point to which I need to backpropagate the closed argument in window.py. Some thoughts on this?

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

which I need to backpropagate the closed argument in window.py. Some thoughts on this?

do you have a branch that you can post

@carlosdanielcsantos
Copy link
Contributor

@jreback just commited the last changes in which the feature is working for time-based windows (should be quite easy now for fixed windows). Take a look at the code here and let me know what should improved.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@carlosdanielcsantos sure, why don't you make a PR (easier to comment on). at a quick glance looks reasonable.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

superseded by #15795

@jreback jreback closed this Mar 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Resample resample method Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exact matches in time-based .rolling()
8 participants