Skip to content

ewm NaN handling is wrong : simple example #31178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nhuth500 opened this issue Jan 21, 2020 · 19 comments
Open

ewm NaN handling is wrong : simple example #31178

nhuth500 opened this issue Jan 21, 2020 · 19 comments
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Window rolling, ewma, expanding

Comments

@nhuth500
Copy link

Hello,

It seems that the handling of NaN values returns bad values on this simple case (I'm using pandas 0.25.3)

import numpy as np
import pandas as pd
x = np.array([1, np.NaN, 5])
alpha = 0.5
x_ewm = pd.Series(x).ewm(alpha=alpha, adjust=False).mean()
x_ewm_simple = alpha * x[2] + ((1 - alpha) **2) * x[0]
print(x_ewm[2])
print(x_ewm_simple)

3.6666666666666665
2.75

According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ewm.html

When ignore_na is False (default), weights are based on absolute positions. For example, the weights of x and y used in calculating the final weighted average of [x, None, y] are (1-alpha)**2 and 1 (if adjust is True), and (1-alpha)**2 and alpha (if adjust is False).

Can you explain how 3.6666666666666665 is computed? Thank you.

@jbrockmendel jbrockmendel added the Window rolling, ewma, expanding label Feb 25, 2020
@adament
Copy link

adament commented May 1, 2020

I agree that pandas is incorrect when computing ewm(adjust=False).mean(). However I disagree with your example, pandas is actually correct in this case, what you are missing is normalising by the weight sum which is 0.75 and 2.75 / 0.75 = 3.6666666666666665. However if you add an additional entry to your array x, you get the problem:

import pandas as pd
alpha = 0.5
x = np.array([1, float('nan'), 5, 3])
print(pd.Series(x).ewm(alpha=alpha, adjust=False).mean().iloc[-1]) # 3.333333

num = x[0] * (1 - alpha)**3 + x[2] * alpha * (1 - alpha) + x[3] * alpha # 2.875
denom = (1 - alpha)**3 + alpha * (1 - alpha) + alpha # 0.875
print(num / denom) # 3.2857142857142856

I ran into this problem today when trying to match my own implementation of exponential weighted computations today. As far as I can tell the issue is line 1821 in aggregations.pyx which hardcodes the weight sum to 1.0 when adjust is False, but when ignore_na is also False and you skip some weights this is no longer the case. As can be seen in the above example with weight sum 0.875.

@adament
Copy link

adament commented May 1, 2020

The change was added in pull request #7896 and line 105 in test_moments_ewm (which was added as a test in the same pull request) clearly shows the weird weight that this results in for the last entry. Maybe @seth-p can remember why this change was considered a bugfix? At least today with the description given in the documentation I would consider this a surprising weighting scheme.

@seth-p
Copy link
Contributor

seth-p commented May 3, 2020

@adament, I think Pandas is "correct", or at least does what I think it should do. First let me say that I never liked the whole adjust=False business, and think it probably should never be used -- I just kept it because it was there. So when I introduced the concept of ignore_na=False (i.e. absolute position matters; previously the behavior was always equivalent to the current ignore_na=True), I had to pick how it would work with adjust=False. (The desired behavior or ignore_na=False with adjust=True is pretty obvious). Note that the documentation actually does not mention the case where adjust=False and ignore_na=False.

OK, that out of the way, recall that, absent NaNs, the behavior with adjust=False is as follows:

y[0] = x[0]
y[t] = (1-alpha)*y[t-1] + alpha*x[t]

Now what should the behavior be when there are NaNs?
Here's what I came up with:

  • If x[t] is NaN, then y[t] = y[t-1]. This is probably uncontroversial (though I suppose one could prefer to set y[t] = NaN).
  • If x[t] is not NaN, then y[t] = ((1-alpha)^(t-u)*y[u] * alpha*x[t]) / ((1-alpha)^(t-u) + alpha), where u<t be the index of the most recent previous non-NaN x[u]` value.

So let's apply these rules to x = [1., NaN,, 5., 3.]:

y[0] = x[0] = 1.
y[1] = y[0] = 1.
y[2] = ((1-0.5)^(2-0)*y[0] + 0.5*x[2]) / ((1-0.5)^(2-0) + 0.5) = (0.25*1. + 0.5*5.) / (0.25 + 0.5) = 3.66666...
y[3] = ((1-0.5)^(3-2)*y[2] + 0.5*x[3]) / ((1-0.5)^(3-2) + 0.5) = (0.5*3.66666... + 0.5*3.) / (0.5 + 0.5) = 3.33333...

This is precisely what I get from Pandas.
Note that the idea here (as opposed to adjust=True) is that we just define y[t] in terms of x[t] and y[u] where u < t is the most recent index with non-NaN x[t], not in terms of the full history of x[<=t]. Given such a constraint, I think the rules I came up with are about as reasonable as one can get.

If you think the behavior for ignore_na=False and adjust=False should be different, feel free to submit a PR. I honestly would never use this case, so don't really care. But I do think the current behavior is reasonable.

@adament
Copy link

adament commented May 3, 2020

@seth-p Thank you for taking the time to respond and present the rationale for the current behaviour. I followed the user guide where it mentions that adjust=False is given by the weights

w_i = \begin{cases}
\alpha (1 - \alpha}^i, &\text{ for } i < t\\
(1 - \alpha)^t, &\text{ for } i = t
\end{cases}

and later in the same section ignore_na=False is described as:

When ignore_na=False (the default), weights are calculated based on absolute positions, so that intermediate null values affect the result.

From this I would claim that ignore_na=False is specified for any weighted computation, and given the above weights also for adjust=False. However I do agree that it is weird to use these settings in combination.

I will try to find time next week to submit a pull request. But it is not clear to me whether the best course of action is to attempt to update the documentation or the computation.

@seth-p
Copy link
Contributor

seth-p commented May 3, 2020

@adament, yeah, that weight formula is for the non-NaN case. I agree that the documentation is at best silent and at worst misleading as to the adjust=False and ignore_na=False case.

@nhuth500
Copy link
Author

nhuth500 commented May 3, 2020

Hello,

Thanks a lot to @adament and @seth-p for taking the time to look at this issue.
If we come back to the original definition of ewm when adjust=False from the detailed documentation [https://pandas.pydata.org/pandas-docs/stable/user_guide/computation.html#exponentially-weighted-windows], we have that the ewm.mean(adjust=False) output is:

y[t] = (1 - alpha) * y[t-1] + alpha * x[t]
y[0] = x[0]

This can be refomulated as :

y[t] = sum_{i=0...t} (x[t-i] * w[i]) / sum_{i=0...t} w[i]
w[i] = alpha * (1 - alpha)^i for i < t
w[i] = (1 - alpha)^i for i = t

by iterating the original definition and noticing that in the case of non-NA series ONLY we have sum_{i=0...t} w[i] = 1. I think we all agree on this plain vanilla case.

The question is what one should do when there is a NaN value in the series like in the example [1, NaN, 5].
Let's do it step by step on our example with ignore_na=False and alpha=0.5:

y[0] = x[0] = 1
y[1] = (1 - alpha) * y[0] + alpha * x[1] = y[0] = 1 since x[1]=NaN

Once again we all agree so far. Now when t=2 we have the following:

y[2] = (1 - alpha) * y[1] + alpha * x[2] = (1 - alpha) * y[1] + alpha * 5

The big question is which value to use for y[1].

If we use the previous output then we should consider that y[1]=1, hence we have:

y[2] = 0.5 * 1 + 0.5 * 5 = 3.

Another possibility is to use the definition of the ewm mean as a weighted average of x :

y[2] = (w[0]x[2] + w[1]x[1] + w[2]x[0]) / (w[0] + w[1] + w[2])
w[0] = alpha = 0.5
w[1] = alpha * (1 - alpha) = 0.25
w[2] = (1 - alpha)^2 = 0.25

Usually when all x values are non-NaN we have the weights sum is equal to 1 as said earlier. However here since x[1] is NaN we have:

y[2] = (0.5 * 5 + 0 + 0.25 * 1) / (0.5 + 0 + 0.25) = 3.6666666666666665

where we have assumed that we replace w[1]=0.25 by w[1]=0.

If we leave w[1]=0.25 then we have the sum of weights is 1 and :

y[2] = 0.5 * 5 + 0 + 0.25 * 1 = 2.75

So to sum up, from my original post we see that currently pandas returns 3.6666666666666665 so it uses the definition of the ewm as a weighted average (with zero weights on NaN values) even in the case of adjust=False. However the documentation lets us think that it's the recursive formula that is used when adjust=False. When the series has no NaN values both definitions coincide. However when NaN arise they differ and the output is no more consistent with the recursive formula. So what do you think should be the "genuine" output in this case? 3, 2.75 or 3.6666666666666665 ?

@seth-p
Copy link
Contributor

seth-p commented May 3, 2020

@nhuth500, the values 3, 2.75, and 3.66666667 are all obtainable as follows:

pd.Series([1., np.nan, 5., 3.]).ewm(alpha=0.5, adjust=False).mean().values
array([1., 1., 3.66666667, 3.33333333])  # 3.66666667

pd.Series([1., np.nan, 5., 3.]).fillna(0).ewm(alpha=0.5, adjust=False).mean().values
array([1., 0.5, 2.75, 2.875])  # 2.75

pd.Series([1., np.nan, 5., 3.]).ewm(alpha=0.5, adjust=False, ignore_na=True).mean().values
array([1., 1., 3., 3.])  # 3.

Note that unless the user explicitly wants to replace NaNs with zeros (which they can easily do as above!), the 2.75 value doesn't really make sense, as can be seen by considering the following, where arguably all (non-NaN) result values should be 1.

pd.Series([1., np.nan, 1., 1.]).fillna(0).ewm(alpha=0.5, adjust=False).mean().values
array([1., 0.5, 0.75, 0.875])

Overall it still seems to me that the current implementation is the most logically consistent one for adjust=False, ignore_na=False.

@mroeschke mroeschke added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels May 3, 2020
@nhuth500
Copy link
Author

nhuth500 commented May 3, 2020

@seth-p if we consider that we have to use the weighted average formula instead of the recursive formula in the case of a series with nan values, then I agree with @adament that in the example x=[1, NaN, 5, 3] we have:

y[3] = (w[0] * x[3] + w[1] * x[2] + w[2] * x[1] + w[3] * x[0]) / (w[0] + w[1] + w[2] + w[3])
y[3] = (alpha * 3 + alpha * (1-alpha) * 5 + (1-alpha)**3 * 1) / (alpha + alpha * (1 - alpha) + (1 - alpha)**3)
y[3] = 3.2857142857142856

By the way if you set adjust=True on the same example you end up with:

y[3] = (w[0] * x[3] + w[1] * x[2] + w[2] * x[1] + w[3] * x[0]) / (w[0] + w[1] + w[2] + w[3])
y[3] = (1 * x[3] + (1 - alpha) * x[2] + (1 - alpha)**3 * x[0]) / (1 + (1 - alpha) + (1 - alpha)**3)
y[3] = 3.4615384615384617

which is pd.Series(x).ewm(alpha=alpha, adjust=True).mean().values[3].

Hence I still think there is an issue on adjust=False when the input contains NaN values. I can't find any formula from the documentation (either as a weighted average or as a recursive forumla) with which the current behaviour is consistent.

@seth-p
Copy link
Contributor

seth-p commented May 3, 2020

@nhuth500, you write, "if we consider that we have to use the weighted average formula instead of the recursive formula in the case of a series with nan values...". But we don't. The whole conceptual distinction between adjust=True and adjust=False (which is not at all clear from the name "adjust", but which I tried to make in the documentation) is that the former is defined using a weighted-average formula (where the weights of x[0], ..., x[t] may or may not depend on the presence of NaNs, depending on ignore_na), whereas the latter is defined using a recursive formula (where again the weights of y[t-1] and x[t] may or may not depend on the presence of NaNs, depending on ignore_na). I think one should keep this conceptual distinction in mind when deciding how the code should handle NaNs and ignore_na.

Put another way, feel free to suggest another set of rules for the adjust=False and ignore_na=False case, but I feel reasonably strongly that it should be defined in term of a rule for y[t] as a weighted sum of y[t-1] and x[t], where the weights depend on the presence of NaNs in x[0], ..., x[t].

By the way, the term adjust predates me, and has to do with the manner in which the code was originally implemented (i.e. back then the code first calculated the adjust=False sums recursively, and then, if adjust=True, "adjusted" the weight of the first term). I find this name completely uninformative. If I were starting from scratch I would use a recursive (= not adjust) flag instead.

I agree that the documentation does not discuss explicitly the case adjust=False and ignore_na=False. I can't submit a PR these days, but I can outline here how I would document things to clarify the current behavior. I'll try to do this later today or tomorrow.

@seth-p
Copy link
Contributor

seth-p commented May 4, 2020

OK, here's how I would describe the adjust=False behavior. This could be formatted better for the docs.

When adjust=False, y[] is defined recursively.
Without loss of generality, assume x[0] is not NaN.
First, we set y[0] = x[0].
Then, for t = 1, 2, ... we set y[t] = y[t-1] if x[t] is NaN, otherwise we set y[t] = ((1-alpha)^(t-u) * y[t-1] + alpha * x[t]) / ((1-alpha)^(t-u) + alpha), where (i) if ignore_na=True, then u = t-1 (so t-u = 1); and (ii) if ignore_na=False, then u is the largest index <t for which x[u] is not NaN. Note that in the absence of NaNs, t-u = 1 regardless of whether ignore_na is True or False.

The above is as concise as I could describe things. For expository purposes, it may be better to proceed as follows.

When adjust=False, y[] is defined recursively.
First, suppose x[] has no NaNs. Then we define:
y[0] = x[0]; and
y[t] = (1-alpha) * y[t-1] + alpha * x[t] for t = 1, 2, ....
Now suppose more generally x[] does contain NaNs. Without loss of generality suppose x[0] is not NaN. As above, e start out with y[0] = x[0], and proceed to calculate y[t] recursively in terms of y[t-1] and x[t].
If x[t] is NaN, we set y[t] = y[t-1].
If x[t] is not NaN, there are two possible formulas, depending on the value of ignore_na:
if ignore_na=True, we set y[t] = (1-alpha) * y[t-1] + alpha * x[t], as above; but if ignore_na=False, we set y[t] = ((1-alpha)^(t-u) * y[t-1] + alpha * x[t]) / ((1-alpha)^(t-u) + alpha), where u<t is the largest index for which x[u] is not NaN.

@adament
Copy link

adament commented May 4, 2020

Yes I agree the source of the confusion is me insisting on interpreting adjust=False as a weighted average with weights that can in some way look exponential. Hence why in the ignore_na=False case I end up considering the current implementation weird. Do you want me to attempt to write up your definition or would you prefer doing it yourself?

Thank you very much for your patience @seth-p.

@nhuth500
Copy link
Author

nhuth500 commented May 4, 2020

@seth-p the doc you suggest sounds good, as it's very clear on what's done in the current implementation of adjust=False, ignore_na=False. Also I agree that weighting the last non-NaN value of y with weight (1-alpha)^(t-u) is very reasonable. Thanks again!

@nhuth500
Copy link
Author

nhuth500 commented May 4, 2020

Small comment : even though reasonable, the current behaviour when adjust=False, ignore_na=False might still be a debate. It can lead to discontinuous outputs when a series has a significant amount of consecutive NaNs. In this case the output will be forward filled to the last valid output value, and when the NaNs streak ends most of the weight is on x[t] since t-u is large. This can lead to a large jump from y[t-u] to x[t]. That would be dampened if we stick to original definition of the recursive formula and return (1-alpha)*y[t-1]+alpha*x[t].

@seth-p
Copy link
Contributor

seth-p commented May 4, 2020

@nhuth500, The whole point of ignore_na=False is that the more NaNs precede the current x[t], the more heavily y[t] is weighted towards x[t]. That's life -- or maybe I should say, that's the data. Your proposal of always using (1-alpha)*y[t-1]+alpha*x[t] is simply the ignore_na=True case.

@adament, I'm afraid I cannot contribute a PR. Feel free to contribute one, adapting what I wrote here as you see fit.

@nhuth500
Copy link
Author

nhuth500 commented May 4, 2020

@seth-p : I agree that's life and ignore_na somehow handles that, but in a quite not flexible way. We could introduce an integer parameter that resets the output to x only after a given number (set by this new parameter) of consecutive NaNs accumulate.

@seth-p
Copy link
Contributor

seth-p commented May 4, 2020

@nhuth500, that strikes me as oddly discontinuous, but I suppose it's an option.

@nhuth500
Copy link
Author

nhuth500 commented May 5, 2020

@seth-p my point is that with ignore_na=False, you end up with unwanted discontinuities that can be sizeable (relatively) even after 1 or 2 missing observations since (1-alpha)^(t-u) decays very quickly with t-u. For "small" NaNs streaks like this, which can happen quite often, you might want to use the ignore_na=True that will weight y[t-u] and x[t] as if nothing happened. However it's true that after a significant NaNs series, say t-u=5 i.e. one week in case of business days observations, one might want to consider y[t-5] a bit too out-dated and then decrease its weight in some way, probably as ignore_na=False would do it.
To sum up, as a user, I would find useful that sort of hybrid between ignore_na=True/False with an integer parameter dictating how many NaNs have to accumulate before swiching to ignore_na=False. In most cases (one or two consecutive NaNs) it would erase the artificial discontinuities created by ignore_na=False but for long missing observations periods it would weigh down y[t-u] (of course in that case one cannot avoid any discontinuity).

@seth-p
Copy link
Contributor

seth-p commented May 5, 2020

@nhuth500, note that if you're going to implement that functionality for the adjust=False case, you should probably do it for adjust=True as well. As I mentioned above, it doesn't really appeal to me, but I'm not really involved with Pandas anymore, so that doesn't matter. If you want to pursue it, I'd suggest making a new issue with the proposed new behaviour (presumably backwards-compatible).

@nhuth500
Copy link
Author

nhuth500 commented May 6, 2020

@seth-p yes I will make a new issue to add this parameter to ewm. Thanks again.

@mroeschke mroeschke added Docs and removed Bug labels Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

5 participants