Skip to content

Bug when computing rolling_mean with extreme value #11645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
julienvienne opened this issue Nov 19, 2015 · 6 comments · Fixed by #36433
Closed

Bug when computing rolling_mean with extreme value #11645

julienvienne opened this issue Nov 19, 2015 · 6 comments · Fixed by #36433
Labels
API Design Docs Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@julienvienne
Copy link

Hello,
Please consider the following code :

import pandas as pd
import numpy as ny

dates = pd.date_range("2015-01-01", periods=10, freq="D")
ts = pd.TimeSeries(data=range(10), index=dates, dtype=ny.float64)
ts_mean = pd.rolling_mean(ts, 5)
print(ts) 
2015-01-01    0
2015-01-02    1
2015-01-03    2
2015-01-04    3
2015-01-05    4
2015-01-06    5
2015-01-07    6
2015-01-08    7
2015-01-09    8
2015-01-10    9
Freq: D, dtype: float64

print(ts_mean)
2015-01-01   NaN
2015-01-02   NaN
2015-01-03   NaN
2015-01-04   NaN
2015-01-05     2
2015-01-06     3
2015-01-07     4
2015-01-08     5
2015-01-09     6
2015-01-10     7
Freq: D, dtype: float64

For the last date (2015-01-10), you should obtain 7, which corresponds to [5, 6, 7, 8, 9] mean value.
Now, please replace the 2015-01-03 value by -9+33 extreme value.

dates = pd.date_range("2015-01-01", periods=10, freq="D")
ts = pd.TimeSeries(data=range(10), index=dates, dtype=ny.float64)
ts[2] = -9e+33
print(ts)
2015-01-01    0.000000e+00
2015-01-02    1.000000e+00
2015-01-03   -9.000000e+33
2015-01-04    3.000000e+00
2015-01-05    4.000000e+00
2015-01-06    5.000000e+00
2015-01-07    6.000000e+00
2015-01-08    7.000000e+00
2015-01-09    8.000000e+00
2015-01-10    9.000000e+00
Freq: D, dtype: float64

And compute rolling_mean again :

ts_mean = pd.rolling_mean(ts, 5)
print(ts_mean)
2015-01-01             NaN
2015-01-02             NaN
2015-01-03             NaN
2015-01-04             NaN
2015-01-05   -1.800000e+33
2015-01-06   -1.800000e+33
2015-01-07   -1.800000e+33
2015-01-08    0.000000e+00
2015-01-09    1.000000e+00
2015-01-10    2.000000e+00
Freq: D, dtype: float64

As you can see, from the 2015-01-08, computation returns an incorrect result i.e [1, 2, 3] instead of [5, 6, 7]. The extreme value has introduced some perturbations in following date computation.

Best regards,

@bashtage
Copy link
Contributor

This is not a bug but is instead a feature of floating point math. Efficient rolling mean makes use of a rolling sum. Having numbers that are differ in magnitude by 1/np.finfo(np.double).eps results in truncation. So when you add the big number in you effectively los all information in the small numbers, and when that number is finally removed, there is nothing in the rolling sum about the small numbers, and so it is as if they were 0.

@julienvienne
Copy link
Author

Thank you for your quick anwser. I take the point.
However, don't you think user could be warned about such a behavior ? Data magnitude may be tested before computation. A warning would be raised in the case of extreme values detection.
In my case -9+33 was an outlier data which I had not filtered before. Good data were small values and the result was obviously wrong...

Best regards

@bashtage
Copy link
Contributor

I don't think it is really possible to warn about numeric limits without substantially affecting performance. For example

x = np.array([2e17]) ** 2 + 1 - np.array([2e17]) ** 2

x is clearly 1 to a human, but is 0 when evaluated.

Also

np.array([2e17]) ** 2 - np.array([2e17]) ** 2 + 1
np.array([2e17]) ** 2 + 1 - np.array([2e17]) ** 2 

should be the same but they aren't, and numpy doesn't provide any warning. I think it is a lot to ask them to protect the end user form numerical limits.

@kawochen
Copy link
Contributor

I think it would be fair to add a note in the doc about the implementation. In this example, a user may not know that previous values affect later values even when the window no longer contains those values. The same goes for other algorithms, and info about time/space complexity can be useful too.

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations API Design labels Nov 19, 2015
@julienvienne
Copy link
Author

I agree because I clearly tried to find some explainations in the doc before making tests on my own. Some implementation infos would have helped.
Thanks for your answers.
Regards,

@jreback jreback added the Docs label Nov 20, 2015
@jreback jreback added this to the Next Major Release milestone Nov 20, 2015
@jreback
Copy link
Contributor

jreback commented Nov 20, 2015

ok, how about we add to the docs, @julienvienne up for a pull-request?

note that #11603 will be merged shortly. So do against the new structure for docs (well its the same in the original but going to be deprecated, so do on the new ones)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Docs Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants