Skip to content

Unexpected behavior for rolling moments when center=True and min_periods < window #8269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stahlous opened this issue Sep 14, 2014 · 14 comments · Fixed by #8275
Closed

Unexpected behavior for rolling moments when center=True and min_periods < window #8269

stahlous opened this issue Sep 14, 2014 · 14 comments · Fixed by #8275
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@stahlous
Copy link
Contributor

I noticed while using rolling moment functions with center=True and with min_periods < window that the values at the end of the series were showing up as NaN when they should have be finite floating point values. The reason for this behavior is that _rolling_moment() concats extra NaN values to the end of the input series when center=True to allow the roll_generic() function to continue on past the end of the original series and compute the values that will "come within view" when the result is centered. However, adding those NaN values into the window can bork the calculation and change a result that should be finite into a NaN.

This example illustrates the behavior:

n = 12
s = Series(np.random.randn(n))
s.plot(color='b')
win=7
minp = 5
pd.rolling_mean(s, win, min_periods=minp, center=False).plot(color='g')
pd.rolling_mean(s, win, min_periods=minp, center=True).plot(color='r')
ticks = plt.xticks(np.arange(0, n, 1.0))

image

I was able to mitigate this behavior, for one calculation at least, by modifying the cython function to essentially strip the trailing NaN values for the calculations made towards the end of the series. This requires passing in the offset calculated in _rolling_moment().

Fixing this is going to touch a lot of functions that depend on _rolling_moment(). Before I go much further with this, I thought I'd raise the issue here and see if anyone else has insight or better ideas about how to fix this.

@stahlous
Copy link
Contributor Author

@seth-p I believe you added the additional NaN values feature to _rolling_moment(). Would you mind taking a look at this?

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

@stahlous always pd.show_versions()

@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Sep 14, 2014
@seth-p
Copy link
Contributor

seth-p commented Sep 14, 2014

Will try to take a look later today or tomorrow.

On Sep 14, 2014, at 12:55 PM, stahlous [email protected] wrote:

@seth-p I believe you added the additional NaN values feature to _rolling_moment(). Would you mind taking a look at this?


Reply to this email directly or view it on GitHub.

@stahlous
Copy link
Contributor Author

Ack. I think I might be misleading you by convoluting 0.14.1 with the dev version. However, the problem does stand for rolling_apply() when the aggregating function will (likely) be naive about the presence of trailing NaN values.

My apologies. I first noticed this with rolling_apply() and then I started working with the latest stable and thought I had discovered that the problem was more general.

I'll re-state the issue this evening when I am able to get in front of my dev box again.

@seth-p
Copy link
Contributor

seth-p commented Sep 14, 2014

I think this has been addressed in master (for 0.15.0) in #7934. Let me know if you still see a problem in master.

@seth-p
Copy link
Contributor

seth-p commented Sep 14, 2014

I will try to check tonight/tomorrow if still need to fix rolling_apply.

@stahlous
Copy link
Contributor Author

Sorry for the confusion. I did confirm that rolling_mean() does work as expected in the dev version. I should've stuck to what I knew rather than digging further and confusing things. That said, rolling_apply() does suffer from the problem I described:

s = pd.Series(np.random.randn(12))
win = 7
minp = 5
print(pd.rolling_apply(s, win, np.mean, min_periods=minp, center=False))
print(pd.rolling_apply(s, win, np.mean, min_periods=minp, center=True))

Outputs:

0          NaN
1          NaN
2          NaN
3          NaN
4     0.321198
5     0.593657
6     0.566094
7     0.481168
8     0.404456
9     0.212539
10    0.229766
11    0.086256
dtype: float64

0          NaN
1     0.321198
2     0.593657
3     0.566094
4     0.481168
5     0.404456
6     0.212539
7     0.229766
8     0.086256
9          NaN
10         NaN
11         NaN
dtype: float64

pd.show_versions()

Outputs:

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1-430-g35a9527
nose: None
Cython: 0.21
numpy: 1.9.0
scipy: 0.14.0
statsmodels: None
IPython: 2.2.0
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.7
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

I was able to hack together a patch for this in roll_generic(), but it requires passing in offset from _rolling_moment(), which would break its API for all the other functions using it.

n = len(input)
for i in range(int_min(win, n)):
    if counts[i] >= minp:
        output[i] = func(input[int_max(i - win + 1, 0) : i + 1], *args,
                         **kwargs)
    else:
        output[i] = NaN

for i in range(win, n - offset):
    buf = buf + 1
    bufarr.data = <char*> buf
    if counts[i] >= minp:
        output[i] = func(bufarr, *args, **kwargs)
    else:
        output[i] = NaN

if offset > 0:
    for j, i in enumerate(range(n - offset, n)):
        if counts[i] >= minp:
            output[i] = func(input[int_max(i - win + 1, 0) : i - j], 
                               *args, **kwargs)
        else:
            output[i] = NaN

The easier fix would be to have rolling_apply() perform its own call to the cython function rather than using _rolling_moment() as an intermediary, but that's not very DRY. The alternatives that I can think of would involve changing the _rolling_moment() API, which might be more trouble than it's worth.

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Ok, I will take a look and fix.

On Sep 14, 2014, at 7:41 PM, stahlous [email protected] wrote:

Sorry for the confusion. I did confirm that rolling_mean() does work as expected in the dev version. I should've stuck to what I knew rather than digging further and confusing things. That said, rolling_apply() does suffer from the problem I described:

s = pd.Series(np.random.randn(12))
win = 7
minp = 5
print(pd.rolling_apply(s, win, np.mean, min_periods=minp, center=False))
print(pd.rolling_apply(s, win, np.mean, min_periods=minp, center=True))

0 NaN
1 NaN
2 NaN
3 NaN
4 0.321198
5 0.593657
6 0.566094
7 0.481168
8 0.404456
9 0.212539
10 0.229766
11 0.086256
dtype: float64

0 NaN
1 0.321198
2 0.593657
3 0.566094
4 0.481168
5 0.404456
6 0.212539
7 0.229766
8 0.086256
9 NaN
10 NaN
11 NaN
dtype: float64
pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1-430-g35a9527
nose: None
Cython: 0.21
numpy: 1.9.0
scipy: 0.14.0
statsmodels: None
IPython: 2.2.0
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.7
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
I was able to hack together a patch for this in roll_generic(), but it requires passing in offset from _rolling_moment(), which would break its API for all the other functions using it.

n = len(input)
for i in range(int_min(win, n)):
if counts[i] >= minp:
output[i] = func(input[int_max(i - win + 1, 0) : i + 1], _args,
*_kwargs)
else:
output[i] = NaN

for i in range(win, n - offset):
buf = buf + 1
bufarr.data = <char*> buf
if counts[i] >= minp:
output[i] = func(bufarr, _args, *_kwargs)
else:
output[i] = NaN

if offset > 0:
for j, i in enumerate(range(n - offset, n)):
if counts[i] >= minp:
output[i] = func(input[int_max(i - win + 1, 0) : i - j],
_args, *_kwargs)
else:
output[i] = NaN
The easier fix would be to have rolling_apply() perform its own call to the cython function rather than using _rolling_moment() as an intermediary, but that's not very DRY. The alternatives that I can think of would involve changing the _rolling_moment() API, which might be more trouble than it's worth.


Reply to this email directly or view it on GitHub.

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

OK, I see what the problem is. When center=True, all the rolling_*() functions pad the end with NaN values, so that a window of length window centered on the last entry will be "full". The problem is that np.mean will return NaN if it encounters any NaN values in the input. You won't have this problem if you use np.nanmean instead. (None of the rolling_*() functions besides rolling_apply have a problem, because they don't rely on the numpy functions that return NaN whenever they encounter a NaN.)

The current behavior is asymmetric in the sense that no NaN values are pre-pended at the beginning of the series, and so rolling_apply(..., np.mean, center=True) does calculate values at the start before there is a full window; and so it should have similar behavior at the end. So I guess my implementation of #7934 to fix #7925 wasn't ideal.

Will try to come up with a different implementation that doesn't have this issue.

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

I actually have a bit of a philosophical question about this. For simplicity, let's concentrate on the start rather than the edge of the array. Consider:

pd.rolling_apply(pd.Series([5., 6., 7., 8., 9.]), 3,
                 lambda v: np.nansum([(i+1)*x for i, x in enumerate(v)]),
                 min_periods=1, center=True)

It's not obvious to me what we would want the first entry of the result to be. I can see two perfectly plausible possibilities:
(a) np.nansum([1*5., 2*6.] = 17.. -- this is the current behavior
(b) np.nansum([1*NaN, 2*5., 3*6.] = 28..

Currently (in master) the start of the array is treated as (a), but end of the array is treated as (b) (with NaN padding the end rather than the beginning of the sub-array, of course). Presumably they should be consistent, but to be honest it's not obvious to me which is "right".

Note that it only makes a difference (a) for functions that don't ignore NaN (like your example, np.mean), and (b) for functions that are sensitive to the absolute position of the arguments, like my example here (or, for example, if one tried to implement a rolling EWMA).

I guess I'm slightly inclined to keep behavior (a) (and make it the behavior at the end as well) for backwards-compatibility reasons, but I'm not sure it's obvious...

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

CCing @nbdanielson, since he raised #7925.

@stahlous
Copy link
Contributor Author

I think option (a) is the more robust choice, but adding NaNs is the more efficient choice - for most use cases. I think the answer is to treat rolling_apply() as a special case and create a solution that doesn't involve padding with NaNs. Otherwise all the rolling moment functions would have to be edited to handle the end of the series differently.

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Yes, implementation-wise I am working on a slightly different implementation for rolling_apply(). But just want to make sure no one objects too vehemently to behavior (a).

@nbdanielson
Copy link

I do see the dilemma here. My first thought is that (b) may be a more intuitive behavior, but I can't actually think of a filter where I would prefer this. So yes, I suppose (a) is a decent solution. This would be worth mentioning in the API as well

@jreback jreback added the Bug label Sep 16, 2014
@jreback jreback added this to the 0.15.0 milestone Sep 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants