-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ewm NaN handling is wrong : simple example #31178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree that pandas is incorrect when computing ewm(adjust=False).mean(). However I disagree with your example, pandas is actually correct in this case, what you are missing is normalising by the weight sum which is 0.75 and 2.75 / 0.75 = 3.6666666666666665. However if you add an additional entry to your array x, you get the problem:
I ran into this problem today when trying to match my own implementation of exponential weighted computations today. As far as I can tell the issue is line 1821 in aggregations.pyx which hardcodes the weight sum to 1.0 when adjust is False, but when ignore_na is also False and you skip some weights this is no longer the case. As can be seen in the above example with weight sum 0.875. |
The change was added in pull request #7896 and line 105 in test_moments_ewm (which was added as a test in the same pull request) clearly shows the weird weight that this results in for the last entry. Maybe @seth-p can remember why this change was considered a bugfix? At least today with the description given in the documentation I would consider this a surprising weighting scheme. |
@adament, I think Pandas is "correct", or at least does what I think it should do. First let me say that I never liked the whole OK, that out of the way, recall that, absent
Now what should the behavior be when there are
So let's apply these rules to
This is precisely what I get from Pandas. If you think the behavior for |
@seth-p Thank you for taking the time to respond and present the rationale for the current behaviour. I followed the user guide where it mentions that
and later in the same section
From this I would claim that I will try to find time next week to submit a pull request. But it is not clear to me whether the best course of action is to attempt to update the documentation or the computation. |
@adament, yeah, that weight formula is for the non- |
Hello, Thanks a lot to @adament and @seth-p for taking the time to look at this issue.
This can be refomulated as :
by iterating the original definition and noticing that in the case of non-NA series ONLY we have The question is what one should do when there is a NaN value in the series like in the example
Once again we all agree so far. Now when
The big question is which value to use for If we use the previous output then we should consider that
Another possibility is to use the definition of the ewm mean as a weighted average of x :
Usually when all x values are non-NaN we have the weights sum is equal to 1 as said earlier. However here since
where we have assumed that we replace If we leave
So to sum up, from my original post we see that currently pandas returns 3.6666666666666665 so it uses the definition of the ewm as a weighted average (with zero weights on NaN values) even in the case of |
@nhuth500, the values 3, 2.75, and 3.66666667 are all obtainable as follows:
Note that unless the user explicitly wants to replace
Overall it still seems to me that the current implementation is the most logically consistent one for |
@seth-p if we consider that we have to use the weighted average formula instead of the recursive formula in the case of a series with nan values, then I agree with @adament that in the example
By the way if you set y[3] = (w[0] * x[3] + w[1] * x[2] + w[2] * x[1] + w[3] * x[0]) / (w[0] + w[1] + w[2] + w[3])
|
@nhuth500, you write, "if we consider that we have to use the weighted average formula instead of the recursive formula in the case of a series with nan values...". But we don't. The whole conceptual distinction between Put another way, feel free to suggest another set of rules for the By the way, the term I agree that the documentation does not discuss explicitly the case |
OK, here's how I would describe the When The above is as concise as I could describe things. For expository purposes, it may be better to proceed as follows. When |
Yes I agree the source of the confusion is me insisting on interpreting Thank you very much for your patience @seth-p. |
@seth-p the doc you suggest sounds good, as it's very clear on what's done in the current implementation of |
Small comment : even though reasonable, the current behaviour when |
@nhuth500, The whole point of @adament, I'm afraid I cannot contribute a PR. Feel free to contribute one, adapting what I wrote here as you see fit. |
@seth-p : I agree that's life and ignore_na somehow handles that, but in a quite not flexible way. We could introduce an integer parameter that resets the output to x only after a given number (set by this new parameter) of consecutive NaNs accumulate. |
@nhuth500, that strikes me as oddly discontinuous, but I suppose it's an option. |
@seth-p my point is that with |
@nhuth500, note that if you're going to implement that functionality for the |
@seth-p yes I will make a new issue to add this parameter to ewm. Thanks again. |
Hello,
It seems that the handling of NaN values returns bad values on this simple case (I'm using pandas 0.25.3)
import numpy as np
import pandas as pd
x = np.array([1, np.NaN, 5])
alpha = 0.5
x_ewm = pd.Series(x).ewm(alpha=alpha, adjust=False).mean()
x_ewm_simple = alpha * x[2] + ((1 - alpha) **2) * x[0]
print(x_ewm[2])
print(x_ewm_simple)
3.6666666666666665
2.75
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ewm.html
When ignore_na is False (default), weights are based on absolute positions. For example, the weights of x and y used in calculating the final weighted average of [x, None, y] are (1-alpha)**2 and 1 (if adjust is True), and (1-alpha)**2 and alpha (if adjust is False).
Can you explain how 3.6666666666666665 is computed? Thank you.
The text was updated successfully, but these errors were encountered: