BUG: Fix weighted rolling variance implementation #55153

ihsansecer · 2023-09-15T21:59:29Z

closes BUG: Possible bug in rolling standard deviation calculation when using custom weights (pandas.core.window.rolling.Window.std) #53273, closes BUG: rolling window variance with customized window type behaves badly #54333
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Outputs from the code snippets in the given issues are below

Issue #53273

Code

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.Series(np.random.normal(size=1200))
df_std_stock = (
    df.rolling(window=220, min_periods=100, win_type='exponential')
    .std(tau=-(25 / np.log(2)), center=0, sym=False)
)

df_std_custom = (
    (df ** 2).rolling(window=220, min_periods=100, win_type='exponential')
    .mean(tau=-(25 / np.log(2)), center=0, sym=False)
)
df_std_custom_mean = (
    df.rolling(window=220, min_periods=100, win_type='exponential')
    .mean(tau=-(25 / np.log(2)), center=0, sym=False)
)
df_std_custom = (df_std_custom - df_std_custom_mean ** 2).pow(0.5)

plt.plot(df_std_stock.index, df_std_stock.values, label="df_std_stock", linewidth=3)
plt.plot(df_std_custom.index, df_std_custom.values, label="df_std_custom", linestyle='--')
plt.legend()
plt.show()

Output

Issue #54333

Code

import numpy as np
import pandas as pd
from scipy import signal

def equal(win_len: int, **kwargs):
    assert win_len == 5
    return np.array([1.0, 1.0, 1.0, 1.0, 1.0])

def linear(win_len: int, **kwargs):
    assert win_len == 5
    return np.array([0.2, 0.4, 0.6, 0.8, 1.0])

signal.windows.equal_weight = equal
signal.windows.linear_decay = linear

s = pd.Series([1.0, 0.0] * 10)

print(list(s))
print(list(s.rolling(window=5).var(ddof=0)))
print(list(s.rolling(window=5, win_type="equal_weight").var(ddof=0)))
print(list(s.rolling(window=5, win_type="linear_decay").var(ddof=0)))

Output

[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]
[nan, nan, nan, nan, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24]
[nan, nan, nan, nan, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24]
[nan, nan, nan, nan, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005, 0.24000000000000005]

mroeschke · 2023-09-18T16:35:26Z

Can you show if there is any performance difference before and after this change?

WillAyd · 2023-09-18T16:39:34Z

pandas/_libs/window/aggregations.pyx

    output = np.empty(n, dtype=np.float64)
+    t = np.zeros(n, dtype=np.float64)


This adds a lot of memory overhead - is there a way to avoid creating all of these n-length arrays?

Ah yeah it should be possible I think

This is done

ihsansecer · 2023-09-18T22:55:36Z

Can you show if there is any performance difference before and after this change?

@mroeschke there is a difference but not sure if it matters since the previous implementation wasn't correct

Change	Before [`ebca6df`]	After [`a695dc4`]	Ratio	Benchmark (Parameter)
+	2.18±0.02ms	764±4ms	351.17	rolling.WeightedVar.time_method(1000, 'boxcar')
+	2.25±0.02ms	768±4ms	341.83	rolling.WeightedVar.time_method(1000, 'triang')
+	2.24±0.03ms	73.6±0.3ms	32.83	rolling.WeightedVar.time_method(100, 'boxcar')
+	2.36±0.3ms	74.6±0.1ms	31.57	rolling.WeightedVar.time_method(100, 'triang')
+	2.96±0.9ms	5.09±0.06ms	1.72	rolling.WeightedVar.time_method(10, 'boxcar')
+	3.07±0.9ms	5.07±0.04ms	1.65	rolling.WeightedVar.time_method(10, 'triang')

class WeightedVar:
    params = (
        [10, 100, 1000],
        ["triang", "boxcar"]
    )
    param_names = ["window", "win_type"]

    def setup(self, window, win_type):
        N = 10**5
        arr = np.random.random(N)
        obj = pd.Series(arr)
        self.window = obj.rolling(window, win_type=win_type)

    def time_method(self, window, win_type):
        self.window.var()

jreback · 2023-09-19T01:47:25Z

the new implementation is O^2
this is not good at all

ihsansecer · 2023-09-19T17:27:16Z

the new implementation is O^2 this is not good at all

@jreback I don't think there is a better solution. Let me explain what is wrong with the previous implementation and what this implementation does using an example.

Let's say we have:

values = [0, 1, 2, 3]
weights = [10, 20]

We are supposed to update the sum of weights, mean and t each time a value and a weight is added (or removed) to the window as per the paper. After each pair is added (or removed) to the window we can calculate weighted variance.

The previous implementation follows the steps below:

var, w = 0, 10
add(var, w)
output[0] = calculate()

var, w = 1, 20
add(var, w)
output[1] = calculate()

var, w = 2, 10
add(var, w)
var_pre, w_pre = 0, 10
remove(var_pre, w_pre)
output[2] = calculate()

var, w = 3, 20
add(var, w)
var_pre, w_pre = 1, 20
remove(var_pre, w_pre)
output[3] = calculate()

Single pass is possible if adding the head pair (or removing the tail) to the window is enough but it is not. Since all of the pairings are different, we need to first remove all the pairs that are previously added (or start from scratch) and then add the pairs we want.

What the new implementation does is:

var, w = 0, 20
add(var, w)
output[0] = calculate()

reset_t_sumw_mean()
var, w = 0, 10
add(var, w)

var, w = 1, 20
add(var, w)
output[1] = calculate()

reset_t_sumw_mean()
var, w = 1, 10
add(var, w)

var, w = 2, 20
add(var, w)
output[2] = calculate()

reset_t_sumw_mean()
var, w = 2, 10
add(var, w)

var, w = 3, 20
add(var, w)
output[3] = calculate()

ihsansecer · 2023-09-27T00:12:29Z

@jreback @WillAyd @mroeschke any other comments/concerns?

mroeschke · 2023-09-27T01:25:36Z

Is there a way to fix this while keeping the current algorithm? The slowdown is concerning for more common cases where this was already correct

ihsansecer · 2023-09-27T20:17:46Z

Is there a way to fix this while keeping the current algorithm?

I don't think there is. I tried to explain what was wrong with it (and why it is faster) in the previous comment. I think it should be incorrect almost all the time because it is calculating variance for the wrong pairs of weights and values

WillAyd · 2023-09-28T00:52:38Z

From the paper linked:

It is frequently convenient to calculate these quanti-
ties by a method that, unlike direct implementation of
the above definitions, requires only one pass through the
set of data pairs (X~, W~). Such a method is also useful
where it is necessary tc, update (rather than recalculate
ab initio) the mean and variance after the data set has
been extended to include another data pair.

But isn't this algorithm requiring multiple passes instead of simply updating records as the window expands? The scaling and performance at 1000 rows makes this pretty impractical to use for usual datasets I think

ihsansecer · 2023-10-01T00:33:29Z

The scaling and performance at 1000 rows makes this pretty impractical to use for usual datasets I think

@WillAyd It is not the performance at 1000 rows. The number of rows is constant at 100,000 and the varying one is the length of weights.

But isn't this algorithm requiring multiple passes instead of simply updating records as the window expands?

It still calculates variance in one pass (over the same value and weight pairs). The problem is nothing here is expanding really. We have to recalculate variance for each window because pairs are different in each window. I have an excalidraw board below showing how this works using the same example. Hope this makes sense.

By the way, both weighted rolling mean and sum have the same complexity O(wn) where w is the number of weights and n is the number of rows.

pandas/pandas/_libs/window/aggregations.pyx

Lines 1495 to 1500 in 6f0cd8d

    
           for win_i in range(win_n): 
        
               val_win = weights[win_i] 
        
               if val_win != val_win: 
        
                   continue 
        
               for in_i in range(in_n - (win_n - win_i) + 1):

github-actions · 2023-11-01T00:05:32Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-12-18T19:34:13Z

Thanks for the PR here but it has appeared to gone stale. I think the discussion on the PR is better suited for the original issue(s) so let's move the discussion there before proceeding with this

ihsansecer added 3 commits September 15, 2023 22:25

Fix weighted rolling variance implementation

cce2dbe

Merge branch 'main' into fix-rolling-std-custom-weights

4a19678

Add whatsnew

b50eedf

ihsansecer requested a review from WillAyd as a code owner September 15, 2023 21:59

Add importorskip/window

d22266a

WillAyd reviewed Sep 18, 2023

View reviewed changes

ihsansecer added 2 commits September 18, 2023 22:50

Calculate variance one window at a time

dc16ff8

Merge branch 'main' into fix-rolling-std-custom-weights

a695dc4

github-actions bot added the Stale label Nov 1, 2023

mroeschke closed this Dec 18, 2023

jevkus mentioned this pull request Mar 1, 2024

BUG: Possible bug in rolling standard deviation calculation when using custom weights (pandas.core.window.rolling.Window.std) #53273

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix weighted rolling variance implementation #55153

BUG: Fix weighted rolling variance implementation #55153

ihsansecer commented Sep 15, 2023 •

edited

Loading

mroeschke commented Sep 18, 2023

WillAyd Sep 18, 2023

ihsansecer Sep 18, 2023

ihsansecer Sep 18, 2023

ihsansecer commented Sep 18, 2023 •

edited

Loading

jreback commented Sep 19, 2023

ihsansecer commented Sep 19, 2023

ihsansecer commented Sep 27, 2023

mroeschke commented Sep 27, 2023

ihsansecer commented Sep 27, 2023

WillAyd commented Sep 28, 2023

ihsansecer commented Oct 1, 2023 •

edited

Loading

github-actions bot commented Nov 1, 2023

mroeschke commented Dec 18, 2023

		output = np.empty(n, dtype=np.float64)
		t = np.zeros(n, dtype=np.float64)

BUG: Fix weighted rolling variance implementation #55153

BUG: Fix weighted rolling variance implementation #55153

Conversation

ihsansecer commented Sep 15, 2023 • edited Loading

Issue #53273

Code

Output

Issue #54333

Code

Output

mroeschke commented Sep 18, 2023

WillAyd Sep 18, 2023

Choose a reason for hiding this comment

ihsansecer Sep 18, 2023

Choose a reason for hiding this comment

ihsansecer Sep 18, 2023

Choose a reason for hiding this comment

ihsansecer commented Sep 18, 2023 • edited Loading

jreback commented Sep 19, 2023

ihsansecer commented Sep 19, 2023

ihsansecer commented Sep 27, 2023

mroeschke commented Sep 27, 2023

ihsansecer commented Sep 27, 2023

WillAyd commented Sep 28, 2023

ihsansecer commented Oct 1, 2023 • edited Loading

github-actions bot commented Nov 1, 2023

mroeschke commented Dec 18, 2023

ihsansecer commented Sep 15, 2023 •

edited

Loading

ihsansecer commented Sep 18, 2023 •

edited

Loading

ihsansecer commented Oct 1, 2023 •

edited

Loading