Bug when computing rolling_mean with extreme value #11645

julienvienne · 2015-11-19T12:19:03Z

Hello,
Please consider the following code :

import pandas as pd
import numpy as ny

dates = pd.date_range("2015-01-01", periods=10, freq="D")
ts = pd.TimeSeries(data=range(10), index=dates, dtype=ny.float64)
ts_mean = pd.rolling_mean(ts, 5)
print(ts) 
2015-01-01    0
2015-01-02    1
2015-01-03    2
2015-01-04    3
2015-01-05    4
2015-01-06    5
2015-01-07    6
2015-01-08    7
2015-01-09    8
2015-01-10    9
Freq: D, dtype: float64

print(ts_mean)
2015-01-01   NaN
2015-01-02   NaN
2015-01-03   NaN
2015-01-04   NaN
2015-01-05     2
2015-01-06     3
2015-01-07     4
2015-01-08     5
2015-01-09     6
2015-01-10     7
Freq: D, dtype: float64

For the last date (2015-01-10), you should obtain 7, which corresponds to [5, 6, 7, 8, 9] mean value.
Now, please replace the 2015-01-03 value by -9+33 extreme value.

dates = pd.date_range("2015-01-01", periods=10, freq="D")
ts = pd.TimeSeries(data=range(10), index=dates, dtype=ny.float64)
ts[2] = -9e+33
print(ts)
2015-01-01    0.000000e+00
2015-01-02    1.000000e+00
2015-01-03   -9.000000e+33
2015-01-04    3.000000e+00
2015-01-05    4.000000e+00
2015-01-06    5.000000e+00
2015-01-07    6.000000e+00
2015-01-08    7.000000e+00
2015-01-09    8.000000e+00
2015-01-10    9.000000e+00
Freq: D, dtype: float64

And compute rolling_mean again :

ts_mean = pd.rolling_mean(ts, 5)
print(ts_mean)
2015-01-01             NaN
2015-01-02             NaN
2015-01-03             NaN
2015-01-04             NaN
2015-01-05   -1.800000e+33
2015-01-06   -1.800000e+33
2015-01-07   -1.800000e+33
2015-01-08    0.000000e+00
2015-01-09    1.000000e+00
2015-01-10    2.000000e+00
Freq: D, dtype: float64

As you can see, from the 2015-01-08, computation returns an incorrect result i.e [1, 2, 3] instead of [5, 6, 7]. The extreme value has introduced some perturbations in following date computation.

Best regards,

The text was updated successfully, but these errors were encountered:

bashtage · 2015-11-19T14:19:14Z

This is not a bug but is instead a feature of floating point math. Efficient rolling mean makes use of a rolling sum. Having numbers that are differ in magnitude by 1/np.finfo(np.double).eps results in truncation. So when you add the big number in you effectively los all information in the small numbers, and when that number is finally removed, there is nothing in the rolling sum about the small numbers, and so it is as if they were 0.

julienvienne · 2015-11-19T16:55:03Z

Thank you for your quick anwser. I take the point.
However, don't you think user could be warned about such a behavior ? Data magnitude may be tested before computation. A warning would be raised in the case of extreme values detection.
In my case -9+33 was an outlier data which I had not filtered before. Good data were small values and the result was obviously wrong...

Best regards

bashtage · 2015-11-19T17:00:42Z

I don't think it is really possible to warn about numeric limits without substantially affecting performance. For example

x = np.array([2e17]) ** 2 + 1 - np.array([2e17]) ** 2

x is clearly 1 to a human, but is 0 when evaluated.

Also

np.array([2e17]) ** 2 - np.array([2e17]) ** 2 + 1
np.array([2e17]) ** 2 + 1 - np.array([2e17]) ** 2

should be the same but they aren't, and numpy doesn't provide any warning. I think it is a lot to ask them to protect the end user form numerical limits.

kawochen · 2015-11-19T18:38:01Z

I think it would be fair to add a note in the doc about the implementation. In this example, a user may not know that previous values affect later values even when the window no longer contains those values. The same goes for other algorithms, and info about time/space complexity can be useful too.

julienvienne · 2015-11-20T07:54:59Z

I agree because I clearly tried to find some explainations in the doc before making tests on my own. Some implementation infos would have helped.
Thanks for your answers.
Regards,

jreback · 2015-11-20T13:32:13Z

ok, how about we add to the docs, @julienvienne up for a pull-request?

note that #11603 will be merged shortly. So do against the new structure for docs (well its the same in the original but going to be deprecated, so do on the new ones)

jreback added Numeric Operations Arithmetic, Comparison, and Logical operations API Design labels Nov 19, 2015

jreback added the Docs label Nov 20, 2015

jreback added this to the Next Major Release milestone Nov 20, 2015

phofl mentioned this issue Sep 14, 2020

[BUG]: Implement Kahan summation for rolling().mean() to avoid numerical issues #36348

Merged

9 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Sep 15, 2020

phofl mentioned this issue Sep 17, 2020

[DOC]: Add warning about rolling sums with large values #36433

Merged

4 tasks

jreback closed this as completed in #36433 Sep 18, 2020

mroeschke mentioned this issue Sep 25, 2020

REF: Remove rolling window fixed algorithms #36567

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when computing rolling_mean with extreme value #11645

Bug when computing rolling_mean with extreme value #11645

julienvienne commented Nov 19, 2015

bashtage commented Nov 19, 2015

julienvienne commented Nov 19, 2015

bashtage commented Nov 19, 2015

kawochen commented Nov 19, 2015

julienvienne commented Nov 20, 2015

jreback commented Nov 20, 2015

Bug when computing rolling_mean with extreme value #11645

Bug when computing rolling_mean with extreme value #11645

Comments

julienvienne commented Nov 19, 2015

bashtage commented Nov 19, 2015

julienvienne commented Nov 19, 2015

bashtage commented Nov 19, 2015

kawochen commented Nov 19, 2015

julienvienne commented Nov 20, 2015

jreback commented Nov 20, 2015