Skip to content

BUG: Fix weighted rolling variance implementation #55153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ihsansecer
Copy link
Contributor

@ihsansecer ihsansecer commented Sep 15, 2023


Outputs from the code snippets in the given issues are below

Issue #53273

Code

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.Series(np.random.normal(size=1200))
df_std_stock = (
    df.rolling(window=220, min_periods=100, win_type='exponential')
    .std(tau=-(25 / np.log(2)), center=0, sym=False)
)

df_std_custom = (
    (df ** 2).rolling(window=220, min_periods=100, win_type='exponential')
    .mean(tau=-(25 / np.log(2)), center=0, sym=False)
)
df_std_custom_mean = (
    df.rolling(window=220, min_periods=100, win_type='exponential')
    .mean(tau=-(25 / np.log(2)), center=0, sym=False)
)
df_std_custom = (df_std_custom - df_std_custom_mean ** 2).pow(0.5)

plt.plot(df_std_stock.index, df_std_stock.values, label="df_std_stock", linewidth=3)
plt.plot(df_std_custom.index, df_std_custom.values, label="df_std_custom", linestyle='--')
plt.legend()
plt.show()

Output

test

Issue #54333

Code

import numpy as np
import pandas as pd
from scipy import signal

def equal(win_len: int, **kwargs):
    assert win_len == 5
    return np.array([1.0, 1.0, 1.0, 1.0, 1.0])

def linear(win_len: int, **kwargs):
    assert win_len == 5
    return np.array([0.2, 0.4, 0.6, 0.8, 1.0])

signal.windows.equal_weight = equal
signal.windows.linear_decay = linear

s = pd.Series([1.0, 0.0] * 10)

print(list(s))
print(list(s.rolling(window=5).var(ddof=0)))
print(list(s.rolling(window=5, win_type="equal_weight").var(ddof=0)))
print(list(s.rolling(window=5, win_type="linear_decay").var(ddof=0)))

Output

[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]
[nan, nan, nan, nan, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24]
[nan, nan, nan, nan, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24]
[nan, nan, nan, nan, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005]

@ihsansecer ihsansecer requested a review from WillAyd as a code owner September 15, 2023 21:59
@mroeschke
Copy link
Member

Can you show if there is any performance difference before and after this change?

output = np.empty(n, dtype=np.float64)
t = np.zeros(n, dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a lot of memory overhead - is there a way to avoid creating all of these n-length arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah it should be possible I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done

@ihsansecer
Copy link
Contributor Author

ihsansecer commented Sep 18, 2023

Can you show if there is any performance difference before and after this change?

@mroeschke there is a difference but not sure if it matters since the previous implementation wasn't correct

Change Before [ebca6df] After [a695dc4] Ratio Benchmark (Parameter)
+ 2.18±0.02ms 764±4ms 351.17 rolling.WeightedVar.time_method(1000, 'boxcar')
+ 2.25±0.02ms 768±4ms 341.83 rolling.WeightedVar.time_method(1000, 'triang')
+ 2.24±0.03ms 73.6±0.3ms 32.83 rolling.WeightedVar.time_method(100, 'boxcar')
+ 2.36±0.3ms 74.6±0.1ms 31.57 rolling.WeightedVar.time_method(100, 'triang')
+ 2.96±0.9ms 5.09±0.06ms 1.72 rolling.WeightedVar.time_method(10, 'boxcar')
+ 3.07±0.9ms 5.07±0.04ms 1.65 rolling.WeightedVar.time_method(10, 'triang')

class WeightedVar:
    params = (
        [10, 100, 1000],
        ["triang", "boxcar"]
    )
    param_names = ["window", "win_type"]

    def setup(self, window, win_type):
        N = 10**5
        arr = np.random.random(N)
        obj = pd.Series(arr)
        self.window = obj.rolling(window, win_type=win_type)

    def time_method(self, window, win_type):
        self.window.var()

@jreback
Copy link
Contributor

jreback commented Sep 19, 2023

the new implementation is O^2
this is not good at all

@ihsansecer
Copy link
Contributor Author

the new implementation is O^2 this is not good at all

@jreback I don't think there is a better solution. Let me explain what is wrong with the previous implementation and what this implementation does using an example.

Let's say we have:

values = [0, 1, 2, 3]
weights = [10, 20]

We are supposed to update the sum of weights, mean and t each time a value and a weight is added (or removed) to the window as per the paper. After each pair is added (or removed) to the window we can calculate weighted variance.

The previous implementation follows the steps below:

var, w = 0, 10
add(var, w)
output[0] = calculate()

var, w = 1, 20
add(var, w)
output[1] = calculate()

var, w = 2, 10
add(var, w)
var_pre, w_pre = 0, 10
remove(var_pre, w_pre)
output[2] = calculate()

var, w = 3, 20
add(var, w)
var_pre, w_pre = 1, 20
remove(var_pre, w_pre)
output[3] = calculate()

Single pass is possible if adding the head pair (or removing the tail) to the window is enough but it is not. Since all of the pairings are different, we need to first remove all the pairs that are previously added (or start from scratch) and then add the pairs we want.

What the new implementation does is:

var, w = 0, 20
add(var, w)
output[0] = calculate()

reset_t_sumw_mean()
var, w = 0, 10
add(var, w)

var, w = 1, 20
add(var, w)
output[1] = calculate()

reset_t_sumw_mean()
var, w = 1, 10
add(var, w)

var, w = 2, 20
add(var, w)
output[2] = calculate()

reset_t_sumw_mean()
var, w = 2, 10
add(var, w)

var, w = 3, 20
add(var, w)
output[3] = calculate()

@ihsansecer
Copy link
Contributor Author

@jreback @WillAyd @mroeschke any other comments/concerns?

@mroeschke
Copy link
Member

Is there a way to fix this while keeping the current algorithm? The slowdown is concerning for more common cases where this was already correct

@ihsansecer
Copy link
Contributor Author

Is there a way to fix this while keeping the current algorithm?

I don't think there is. I tried to explain what was wrong with it (and why it is faster) in the previous comment. I think it should be incorrect almost all the time because it is calculating variance for the wrong pairs of weights and values

@WillAyd
Copy link
Member

WillAyd commented Sep 28, 2023

From the paper linked:

It is frequently convenient to calculate these quanti-
ties by a method that, unlike direct implementation of
the above definitions, requires only one pass through the
set of data pairs (X~, W~). Such a method is also useful
where it is necessary tc, update (rather than recalculate
ab initio) the mean and variance after the data set has
been extended to include another data pair.

But isn't this algorithm requiring multiple passes instead of simply updating records as the window expands? The scaling and performance at 1000 rows makes this pretty impractical to use for usual datasets I think

@ihsansecer
Copy link
Contributor Author

ihsansecer commented Oct 1, 2023

The scaling and performance at 1000 rows makes this pretty impractical to use for usual datasets I think

@WillAyd It is not the performance at 1000 rows. The number of rows is constant at 100,000 and the varying one is the length of weights.

But isn't this algorithm requiring multiple passes instead of simply updating records as the window expands?

It still calculates variance in one pass (over the same value and weight pairs). The problem is nothing here is expanding really. We have to recalculate variance for each window because pairs are different in each window. I have an excalidraw board below showing how this works using the same example. Hope this makes sense.

Untitled-2023-09-30-1837

By the way, both weighted rolling mean and sum have the same complexity O(wn) where w is the number of weights and n is the number of rows.

for win_i in range(win_n):
val_win = weights[win_i]
if val_win != val_win:
continue
for in_i in range(in_n - (win_n - win_i) + 1):

Copy link
Contributor

github-actions bot commented Nov 1, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 1, 2023
@mroeschke
Copy link
Member

Thanks for the PR here but it has appeared to gone stale. I think the discussion on the PR is better suited for the original issue(s) so let's move the discussion there before proceeding with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants