-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix weighted rolling variance implementation #55153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix weighted rolling variance implementation #55153
Conversation
Can you show if there is any performance difference before and after this change? |
pandas/_libs/window/aggregations.pyx
Outdated
output = np.empty(n, dtype=np.float64) | ||
t = np.zeros(n, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a lot of memory overhead - is there a way to avoid creating all of these n-length arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah it should be possible I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done
@mroeschke there is a difference but not sure if it matters since the previous implementation wasn't correct
class WeightedVar:
params = (
[10, 100, 1000],
["triang", "boxcar"]
)
param_names = ["window", "win_type"]
def setup(self, window, win_type):
N = 10**5
arr = np.random.random(N)
obj = pd.Series(arr)
self.window = obj.rolling(window, win_type=win_type)
def time_method(self, window, win_type):
self.window.var() |
the new implementation is O^2 |
@jreback I don't think there is a better solution. Let me explain what is wrong with the previous implementation and what this implementation does using an example. Let's say we have: values = [0, 1, 2, 3]
weights = [10, 20] We are supposed to update the sum of weights, mean and t each time a value and a weight is added (or removed) to the window as per the paper. After each pair is added (or removed) to the window we can calculate weighted variance. The previous implementation follows the steps below: var, w = 0, 10
add(var, w)
output[0] = calculate()
var, w = 1, 20
add(var, w)
output[1] = calculate()
var, w = 2, 10
add(var, w)
var_pre, w_pre = 0, 10
remove(var_pre, w_pre)
output[2] = calculate()
var, w = 3, 20
add(var, w)
var_pre, w_pre = 1, 20
remove(var_pre, w_pre)
output[3] = calculate() Single pass is possible if adding the head pair (or removing the tail) to the window is enough but it is not. Since all of the pairings are different, we need to first remove all the pairs that are previously added (or start from scratch) and then add the pairs we want. What the new implementation does is: var, w = 0, 20
add(var, w)
output[0] = calculate()
reset_t_sumw_mean()
var, w = 0, 10
add(var, w)
var, w = 1, 20
add(var, w)
output[1] = calculate()
reset_t_sumw_mean()
var, w = 1, 10
add(var, w)
var, w = 2, 20
add(var, w)
output[2] = calculate()
reset_t_sumw_mean()
var, w = 2, 10
add(var, w)
var, w = 3, 20
add(var, w)
output[3] = calculate() |
@jreback @WillAyd @mroeschke any other comments/concerns? |
Is there a way to fix this while keeping the current algorithm? The slowdown is concerning for more common cases where this was already correct |
I don't think there is. I tried to explain what was wrong with it (and why it is faster) in the previous comment. I think it should be incorrect almost all the time because it is calculating variance for the wrong pairs of weights and values |
From the paper linked:
But isn't this algorithm requiring multiple passes instead of simply updating records as the window expands? The scaling and performance at 1000 rows makes this pretty impractical to use for usual datasets I think |
@WillAyd It is not the performance at 1000 rows. The number of rows is constant at 100,000 and the varying one is the length of weights.
It still calculates variance in one pass (over the same value and weight pairs). The problem is nothing here is expanding really. We have to recalculate variance for each window because pairs are different in each window. I have an excalidraw board below showing how this works using the same example. Hope this makes sense. By the way, both weighted rolling mean and sum have the same complexity O(wn) where w is the number of weights and n is the number of rows. pandas/pandas/_libs/window/aggregations.pyx Lines 1495 to 1500 in 6f0cd8d
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the PR here but it has appeared to gone stale. I think the discussion on the PR is better suited for the original issue(s) so let's move the discussion there before proceeding with this |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Outputs from the code snippets in the given issues are below
Issue #53273
Code
Output
Issue #54333
Code
Output