Skip to content

API: Table-wise rolling / expanding / EWM function application #15095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jan 10, 2017 · 11 comments · Fixed by #38417
Closed

API: Table-wise rolling / expanding / EWM function application #15095

TomAugspurger opened this issue Jan 10, 2017 · 11 comments · Fixed by #38417
Labels
API Design Window rolling, ewma, expanding
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 10, 2017

In #11603 (comment) (the main PR implementing the deferred API for rolling / expanding / ewm), we discussed how to specify table-wise applys. Groupby.apply(f) feeds the entire group (all columns) to f. For backwards-compatibility, .rolling(n).apply(f) needed to be column-wise.

#11603 (comment) mentions a possible API like what I added for .style

  • axis=0: apply to each column independently
  • axis=1: apply to each row independently
  • axis=None: apply the supplied function to the entire table

So it'd be df.rolling(n).apply(f, axis=None).
Do people like the axis=0 / 1 / None idiom? Is it obvious enough?

This is prompted by @josef-pkt's post on the mailinglist. Needing a rolling OLS.

An example:

In [2]: import numpy as np
   ...: import pandas as pd
   ...:
   ...: np.random.seed(0)
   ...: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=["A", "B"])
   ...: df
   ...:
Out[2]:
   A  B
0  5  0
1  3  3
2  7  9
3  3  5
4  2  4
5  7  6
6  8  8
7  1  6
8  7  7
9  8  1

For a concrete example, get the table-wise max (this is equivalent to df.rolling(4).max().max(1))

In [10]: df.rolling(4).apply(np.max, axis=None)
Out[10]:
0    NaN
1    NaN
2    NaN
3    9.0
4    9.0
5    9.0
6    8.0
7    8.0
8    8.0
9    8.0
dtype: float64

A real example is something like a rolling OLS:

import statsmodels.api as sm
f = lambda x: sm.OLS.from_formula('A ~ B', data=x).fit()  # wrong, but w/e

df.rolling(5).apply(f, axis=None)
@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Jan 10, 2017
@jreback
Copy link
Contributor

jreback commented Jan 10, 2017

can u put up a simple example with the various options exercised? (e.g. simulate the output)

@TomAugspurger
Copy link
Contributor Author

Updated with an example.

I also changed the suggested API: Before I had

df.rolling(n, axis=None).apply(f)

But really it should be

df.rolling(n).apply(f, axis=None).

The .rolling(axis=.) parameter controls the direction for rolling. The .rolling(...).apply(f, axis=.) parameter controls the axis for function application.

@jreback
Copy link
Contributor

jreback commented Jan 13, 2017

@TomAugspurger correct me if I am wrong, but what you really want is for .apply to be passed one of 2 cases.

  • a single column (now)
  • the entire table (option)

?
The other functions are only univariate so this doesn't matter.

but apply is pretty generic so we don't know what the user wants (but the original implementation was a single column)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 13, 2017

You're correct.

This should make things clear

In [9]: def f(x):
   ...:     print(x)
   ...:     return 0

In [8]: df = pd.DataFrame(np.arange(9).reshape(3, 3))

In [14]: df
Out[14]:
   0  1  2
0  0  1  2
1  3  4  5
2  6  7  8

Currently, and the default in the future, this prints out

In [10]: df.rolling(2).apply(f)
[ 0.  3.]
[ 3.  6.]
[ 1.  4.]
[ 4.  7.]
[ 2.  5.]
[ 5.  8.]

With the new implementation and axis=None, the printed output would be

In [10]: df.rolling(2).apply(f, axis=None)
[[ 0  1, 2],  # first window; 2x3 array
 [ 3, 4, 5]]
[[ 3, 4, 5],  # second window; 2x3 array
  [6, 7, 8]]

@jreback
Copy link
Contributor

jreback commented Jan 14, 2017

@TomAugspurger I know you used axis=None this way in .style, but I personally find this a bit confusing.

I think its better to follow our current model, IOW

receive a DataFrame df.rolling(...).apply(...)
receive a Series df.rolling(...).column.apply(...)

is very natural. This would be an API change, though even now I think we pass a ndarray.

another possibilty is to have return_type = 'frame', 'series', 'ndarray' (with a default of None, so that we can make this change easier).

@dbivolaru
Copy link

dbivolaru commented Mar 25, 2017

I ran into a similar issue with a rolling function that uses OLS internally and needs to return more than one column (eg. the confidence interval).

Would the test cases cover also df.groupby(level=...)['column'].rolling(...).apply(...) and is there a workaround for pre-0.20 versions that would prevent re-calculating the OLS twice ie. for each returned column?

Regarding API, I think the best way it should look like:

def f(narray):
    res = sm.OLS(narray, ...).fit()
    m_min, m_max = res.conf_int(0.05)[0]
    return m_min, m_max

# Single column
df.groupby(level=...)['column'].rolling(...).apply(lambda x: f(x))

def g(exogen, endogen):
    res = sm.OLS(exogen, endogen).fit()
    m_min, m_max = res.conf_int(0.05)[0]
    return m_min, m_max

# Multiple columns
df.groupby(level=...).rolling(...).apply(lambda x: g(x['exogen'], x['endogen']))

@jreback
Copy link
Contributor

jreback commented Mar 25, 2017

@dbivolaru

Would the test cases cover also df.groupby(level=...)['column'].rolling(...).apply(...) and is there a workaround for pre-0.20 versions that would prevent re-calculating the OLS twice ie. for each returned column?

This is just an idea. You are welcome to submit a patch for this.

@makmanalp
Copy link
Contributor

think its better to follow our current model, IOW
receive a DataFrame df.rolling(...).apply(...)
receive a Series df.rolling(...).column.apply(...)
is very natural. This would be an API change, though even now I think we pass a ndarray.

I definitely agree with this - it fits well with everything else.

So is the idea here that because apply() currently works column-wise and not dataframe-wise on dataframe.rolling.apply(), we're kinda locked in now and don't want to break backwards compat, and we need a new API? Or are we just waiting for a patch and and opportune moment to release?

@TomAugspurger
Copy link
Contributor Author

So is the idea here that because apply() currently works column-wise and not dataframe-wise on dataframe.rolling.apply(), we're kinda locked in now and don't want to break backwards compat, and we need a new API?

That's my opinion. We could maybe do this with a deprecation cycle with keywords.

@mroeschke
Copy link
Member

2 thoughts here:

  1. I'm not sure if we should stuff this feature in the axis keyword; I think we should add a new parameter as I can see this being a possibility (from Tom's example). Maybe a how=None|'table' argument for None=1D, table=2D
# roll tablewise along rows
In [10]: df.rolling(2).apply(f, axis=0, how='table')
[[ 0  1, 2],  # first window; 2x3 array
 [ 3, 4, 5]]
[[ 3, 4, 5],  # second window; 2x3 array
  [6, 7, 8]]

# roll tablewise along columns
In [10]: df.rolling(2).apply(f, axis=1, how='table')
[[ 0  1,],  # first window; 3x2 array
 [ 3, 4,],
 [ 6, 7,]]
[[ 1  2,],  # second window; 3x2 array
 [ 4, 5,],
 [ 7, 8,]]
  1. Implementation wise, these might be some potential hurdles & complexities to consider:
  • Currently all windowing aggregations are calculated blockwise. This feature would probably need a dedicated code path that does the calculations over the rows/columns (easier if we eventually remove the block manager)
  • Currently, data types other than float or int are dropped. There's a consistency argument to align that with table-wide windowing but may render table wide rolling less useful if data is dropped.

@mroeschke
Copy link
Member

A proposal for the implementation would be:

  1. Add a new keyword method='table'|'column' in the rolling/ewm/expanding method to specify whether we are rolling over a column or the entire object
  2. Requires the engine='numba' keyword to be set in the aggregation function (otherwise, the existing Cython aggregation functions need an overhaul
  3. Table-wise rolling requires a single float dtype
  4. (Mostly important for apply) the output of table-wise rolling will need to be 1 x number of columns for axis=0 and number of rows x 1 for axis=1

e.g.

df.rolling(2, method='table').apply(f, axis=1, engine='numba')

@jreback jreback added the Window rolling, ewma, expanding label Nov 25, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants