Skip to content

ENH: Allow users to definite their own window bound calculations in rolling #29878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Dec 5, 2019

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Nov 27, 2019

Currently rolling allows integer and offset as window arguments. This PR allows users to define a custom method for calculating window boundaries by passing a subclass of a new BaseIndexer

from pandas.api.indexers import BaseIndexer

class MyWindowIndexer(BaseIndexer):

    def get_window_bounds(self, ...):
          ....
          return (start, end)

indexer = MyWindowIndexer(arg1, arg2, ...)
result = df.rolling(indexer).mean()

Could help address more custom rolling window operations as described in #26959, #13969, #25510, #24295

  • documented new feature
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@mroeschke mroeschke added Enhancement Window rolling, ewma, expanding labels Nov 27, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. Can you add test(s)?

**kwargs :
keyword argument that will be available when get_window_bounds is called
"""
self.__dict__.update(kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this doing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah can we avoid this pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry, it's a bit cryptic.

The motivation is to allow a way for get_window_bounds to access an external variable that may not be passed in rolling or the aggregation function by making it accessible as a class attribute.

For example:

class MyIndexer(BaseIndexer):

    def get_window_bounds(self, ...):
        start = self.external_column + min_periods
        ....

indexer = MyIndexer(external_column=df['external_column'])

Could make this pattern in the __init__ more explicit and add validation.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u can test with a bad BaseIndexer

eg where get_windows_bounds raises or is missing
we would likely give an odd error message

unit tests otherwise are not needed otherwise except as an explicit smoke test; we are already using this in all rolling tests

Computes the fixed bounds of a window.

Parameters
----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think could share these docstrings

"""
Return an indexer class that will compute the window start and end bounds
"""
if isinstance(self.window, window_indexers.BaseIndexer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we don’t actually care if this is a BaseIndexer

rather we care that it has a get_windows_bound with the correct signature

@jbrockmendel
Copy link
Member

so far the only thing in indexers.pyx for which being cython really matters is VariableWindowIndexer.build, which is a static method, so could just as easily be a module-level function. Do these classes need to be in the cython code?

@mroeschke mroeschke added this to the 1.0 milestone Nov 27, 2019
@mroeschke
Copy link
Member Author

Do these classes need to be in the cython code?

Good point @jbrockmendel. I can move these indexers to pandas/core/window/indexers.py and just import "VariableWindowIndexer.build" as a function from pandas/_libs/window/indexers.py

@pep8speaks
Copy link

pep8speaks commented Nov 27, 2019

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-12-04 21:56:57 UTC

def calculate_variable_window_bounds(
int64_t num_values,
int64_t window_size,
object min_periods,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like min_periods is not used. is this a dummy to make the signature match something else? if so, pls comment/docstring

@@ -400,13 +421,15 @@ def _get_cython_func_type(self, func):
self._get_roll_func("{}_fixed".format(func)), win=self._get_window()
)

def _get_window_indexer(self):
def _get_window_indexer(self, index_as_array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can index_as_array or the return type be annotated?

win_type: Optional[str] = None,
) -> Tuple[np.ndarray, np.ndarray]:

raise NotImplementedError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbstractMethodError?

**kwargs :
keyword argument that will be available when get_window_bounds is called
"""
self.index = index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name index makes me expect a pd.Index. Is there a less ambiguous name?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. not using ExpandingIndexer right now, correct?

object center, # unused but here to match get_window_bounds signature
object closed,
const int64_t[:] index
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

min_periods = calculate_min_periods(
window, self.min_periods, len(x), require_min_periods, floor
if not isinstance(window, BaseIndexer):
min_periods = calculate_min_periods(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do these outside of the function? (as they don't depend on anything inside) and makes logic simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These both depend on x after the nans are appended from the center calculation.
x = np.concatenate((x, additional_nans))

start, end = window_indexer(
x, window, self.closed, index_as_array
).get_window_bounds()
if np.any(np.diff(start) < 0) or np.any(np.diff(end) < 0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just set is_monotonic_bounds=np.any(.......)

and pass that in line 482

Copy link
Member Author

@mroeschke mroeschke Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to add is_monotonic_bounds as an argument to all the "fixed" functions in Cython (didn't include them originally since the "fixed" function doesn't need this argument), but I suppose that makes this cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I cannot easily modify this yet without another refactor. func here isn't always a cython function that could accept is_monotonic_bounds; sometimes it's a python wrapped function like apply_func or _zqrt. I can look into this in another refactor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k, followup is ok

@mroeschke
Copy link
Member Author

not using ExpandingIndexer right now, correct?

@jreback. Correct, just exists for testing purposes.

@mroeschke mroeschke changed the title WIP: ENH: Allow users to definite their own window bound calculations in rolling ENH: Allow users to definite their own window bound calculations in rolling Dec 3, 2019
@@ -198,11 +199,10 @@ def roll_sum_variable(ndarray[float64_t] values, ndarray[int64_t] start,
s = start[i]
e = end[i]

if i == 0:
if i == 0 or not is_monotonic_bounds:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we do a simple calculcate in start/end to determine if is_monotonic is true here? it likely is quite performant.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. ping on green.

@TomAugspurger @jorisvandenbossche if any comments

self._validate_get_window_bounds_signature(self.window)

@staticmethod
def _validate_get_window_bounds_signature(window):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can type

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment on the docs. Happy to defer to others on the implementation.

@mroeschke
Copy link
Member Author

All green @jreback

Collecting followups:
#29878 (comment)

@jreback jreback merged commit 1f63971 into pandas-dev:master Dec 5, 2019
@jreback
Copy link
Contributor

jreback commented Dec 5, 2019

thanks @mroeschke very nice

@mroeschke mroeschke deleted the feature/baseindexer branch December 5, 2019 16:17
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
@gandhis1
Copy link
Contributor

Apologies for the very late PR comment, but shouldn't rolling() default to the VariableWindowIndexer when the passed-in offset is a variable-frequency offset, rather than throwing as it does now?

@mroeschke
Copy link
Member Author

Apologies for the very late PR comment, but shouldn't rolling() default to the VariableWindowIndexer when the passed-in offset is a variable-frequency offset, rather than throwing as it does now?

Could you show an example?

@gandhis1
Copy link
Contributor

Python 3.8.13 (default, Mar 28 2022, 11:38:47) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.3.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd
   ...: from pandas.tseries.holiday import USFederalHolidayCalendar
   ...: 
   ...: s = pd.Series(range(10), pd.date_range("7/1/2022", "7/10/2022"))
   ...: bday = pd.tseries.offsets.CustomBusinessDay(n=1, calendar=USFederalHolidayCalendar())
   ...: 

In [2]: s.rolling(bday).median()
ValueError: <CustomBusinessDay> is a non-fixed frequency

In [3]: s.rolling(pd.api.indexers.VariableOffsetWindowIndexer(index=s.index, offset=bday)).median()
Out[3]: 
2022-07-01    0.0
2022-07-02    1.0
2022-07-03    1.5
2022-07-04    2.0
2022-07-05    2.5
2022-07-06    5.0
2022-07-07    6.0
2022-07-08    7.0
2022-07-09    8.0
2022-07-10    8.5
Freq: D, dtype: float64

@mroeschke
Copy link
Member Author

VariableOffsetWindowIndexer is not as performant as the prior implementation for fixed offsets, so I don't think the default could be switched. VariableOffsetWindowIndexer was an enhancement to support non-fixed offsets

@gandhis1
Copy link
Contributor

gandhis1 commented Jul 15, 2022

I'm not suggesting changing the default. My suggestion is that if the "X is a non-fixed frequency" exception is being thrown, it has detected that the passed offset is non-fixed, and thus could intelligently fall back to the variable indexer automatically. I'd think a typical user who is passing in a non-fixed offset is doing so deliberately and isn't wanting an exception.

@mroeschke
Copy link
Member Author

I think a pull request to improve exception message to suggest using VariableOffsetWindowIndexer would be appreciated. Generally fallback & non-explicit behaviors are preferred to avoid surprises.

@gandhis1
Copy link
Contributor

The code located here just raises:

# this will raise ValueError on non-fixed freqs
try:
freq = to_offset(self.window)
except (TypeError, ValueError) as err:
raise ValueError(
f"passed window {self.window} is not "
"compatible with a datetimelike index"
) from err

But this could just be another clause right here that sets self.window to a VariableOffsetWindowIndexer that wraps the original self.window, then from then on it's pretty much the same thing as passing in a BaseIndexer, except now the onus isn't on the user to have to do this:

elif isinstance(self.window, BaseIndexer):
# Passed BaseIndexer subclass should handle all other rolling kwargs
pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants