Skip to content

Bug in rolling_* functions when combining center and min_periods arguments #7925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nbdanielson opened this issue Aug 4, 2014 · 17 comments · Fixed by #7934
Closed

Bug in rolling_* functions when combining center and min_periods arguments #7925

nbdanielson opened this issue Aug 4, 2014 · 17 comments · Fixed by #7934
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@nbdanielson
Copy link

Hello,
I have found some incorrect behavior in the pd.rolling_* functions regarding the combination of 'center' and 'min_periods' arguments. I believe this is an issue for all rolling_* functions. Here is an example using rolling_mean:

When using only the 'center' argument, the behavior works as expected:

In [5]: df = pd.DataFrame(range(10)) 
In [6]: df
Out[6]: 
   0
0  0
1  1
2  2
3  3
4  4
5  5
6  6
7  7
8  8
9  9
In [7]: pd.rolling_mean(df, window=3, center=True)
Out[7]: 
    0
0 NaN
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9 NaN

When using the 'min_periods' argument, the behavior also works as expected:

In [8]: pd.rolling_mean(df, window=3, min_periods=1)
Out[8]: 
     0
0  0.0
1  0.5
2  1.0
3  2.0
4  3.0
5  4.0
6  5.0
7  6.0
8  7.0
9  8.0

When combining both the 'center' and 'min_periods' arguments, however, the result is incorrect. The last entry should be 8.5 I believe.

In [9]: pd.rolling_mean(df, window=3, center=True, min_periods=1)
Out[9]: 
     0
0  0.5
1  1.0
2  2.0
3  3.0
4  4.0
5  5.0
6  6.0
7  7.0
8  8.0
9  NaN
In [10]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-4-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.19.2
numpy: 1.8.0
scipy: 0.13.0
statsmodels: 0.6.0.dev-bed3499
IPython: 2.0.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: None
tables: 2.3.1
numexpr: 2.0.1
matplotlib: 1.3.1
openpyxl: 1.5.8
xlrd: 0.6.1
xlwt: 0.7.4
xlsxwriter: None
sqlalchemy: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

cc @seth-p want to have a look

the center keyword IIRC is really fragile (in its definition) and could easily be wrong. but not really sure.

@jreback jreback added the Numeric label Aug 4, 2014
@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

I have to admit I've hardly ever even looked at the center argument. I'll take a look, but others may do better.

CC'ing @changhiskhan, who seems to have worked on _center_window().

@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

OK, I see what's going on. The way the code works, when you call rolling_*(arg, ..., center=True/False) it first does the rolling calculations with the final window ending on the final element of arg. If center=False it simply labels the rows by the last value of the window, and stops there. If center=True, it relabels the rows based on the center of each window, filling in NaNs where necessary. Thus the center argument affects only the labeling/alignment of results, not their calculation.

So in the example provided, regardless of the value of center, the final calculated value is based on a window on [7.0, 8.0, 9.0], producing a mean of 8.0. With the default center=False, this would lead to row 9 having a value of 8.0. With center=True, this leads to row 8 having a value of 8.0, and row 9 is filled in with NaN.

Now, clearly what you're expecting is that in the center=True case it calculate the row 9 value as the mean of a final window [8.0, 9.0, NaN]. But the code isn't set up to do that.

I suppose that in the center=True case we could first pad the input data with an additional window/2 NaNs, do the calculations, and then adjust the indexes. I can't think of any obvious problems, other than performance issues, which probably aren't severe. (Note that most expanding_* functions are implemented in terms of the corresponding rolling_* functions with window=len(arg), so this has the potential to increase the data size by 50% when center=True. Though I suppose you could argue that the extra calculations are justified/required.) Ideally someone with more knowledge of the center parameter (@changhiskhan?) can opine.

@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

For reference, the only difference between center=True and center=False is that _center_window() is called on the final result if center=True:

def _rolling_moment(arg, window, func, minp, axis=0, freq=None, center=False,
                    how=None, args=(), kwargs={}, **kwds):
    arg = _conv_timerule(arg, freq, how)
    calc = lambda x: func(x, window, minp=minp, args=args, kwargs=kwargs,
                          **kwds)
    return_hook, values = _process_data_structure(arg)
    # actually calculate the moment. Faster way to do this?
    if values.ndim > 1:
        result = np.apply_along_axis(calc, axis, values)
    else:
        result = calc(values)

    rs = return_hook(result)
    if center:
        rs = _center_window(rs, window, axis)
    return rs


def _center_window(rs, window, axis):
    if axis > rs.ndim-1:
        raise ValueError("Requested axis is larger then no. of argument "
                         "dimensions")

    offset = int((window - 1) / 2.)
    if isinstance(rs, (Series, DataFrame, Panel)):
        rs = rs.shift(-offset, axis=axis)
    else:
        rs_indexer = [slice(None)] * rs.ndim
        rs_indexer[axis] = slice(None, -offset)

        lead_indexer = [slice(None)] * rs.ndim
        lead_indexer[axis] = slice(offset, None)

        na_indexer = [slice(None)] * rs.ndim
        na_indexer[axis] = slice(-offset, None)

        rs[tuple(rs_indexer)] = np.copy(rs[tuple(lead_indexer)])
        rs[tuple(na_indexer)] = np.nan
    return rs

I must admit that I don't quite follow all the details in this slicing business.

@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

I debugged this a year or so ago
was so tricky - mainly because I didn't understand what it was supposed to actually do in various cases

that's why I say it's fragile

@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

Well, I think I could modify the code so that, in the case center=True, the rs provided to _center_window() has the extra int((window - 1) / 2.) values that aren't currently being calculated. However someone more knowledgeable than I would need to tweak _center_window()...

@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

I think what we really need is a better doc on what it's supposed to do

can u take a look at past PRs and see if u can figure out?

it's the edge conditions that are tricky

@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

For starters, here's a stupid question: In _center_window(), what could rs be other than Series, DataFrame, or Panel?

@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

ahh
it could be an ndarray!

in theory the roll functions can deal with that as well (I think wesm wanted to support that originally)

@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

OK, I see. I think it should be straightforward.

@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

OK, I think seth-p@6e2eefc addresses the problem. Note that I still need to update the tests (they expect NaNs where now there are values). But let me know how this looks.

@seth-p
Copy link
Contributor

seth-p commented Aug 4, 2014

I've submitted a PR for this. If it looks ok for 0.15.0, I will update the v0.15.0 release notes.

@nbdanielson
Copy link
Author

@seth-p, @jreback Thanks guys, I appreciate your looking into this

@jreback jreback added the Algos label Aug 5, 2014
@jreback jreback added this to the 0.15.0 milestone Aug 5, 2014
@seth-p
Copy link
Contributor

seth-p commented Aug 5, 2014

Note that this also affects expanding_* functions, e. g. (in 0.14.1):
``
In [4]: s = Series(range(10))

In [11]: expanding_mean(s, center=True)
Out[11]:
0 2.0
1 2.5
2 3.0
3 3.5
4 4.0
5 4.5
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64

In [12]: expanding_mean(s, min_periods=0, center=True)
Out[12]:
0 2.0
1 2.5
2 3.0
3 3.5
4 4.0
5 4.5
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64

@seth-p
Copy link
Contributor

seth-p commented Aug 6, 2014

Come to think of it, the expanding_mean(..., center=True) results look just plain odd, notwithstanding the final NaNs. There's no way in !$# that anyone would want that. I understand why they are the way they are: expanding_*(arg, ...) is calculated as rolling_*(arg, ..., window=len(arg), min_periods=0), so everything is shifted by a fixed (len(arg)-1)/2. I suppose the "right" thing to do is to "halve" the indices, rather than shift them, but it still seems sort of odd.

Perhaps best simply to remove the center argument from all the expanding_* functions? (I'm not sure it's that easy, the way the code is written, but ignore that for now.) Given the odd behavior, I very much doubt anyone is using them with center=True.

@seth-p
Copy link
Contributor

seth-p commented Aug 6, 2014

Actually, looks like it's totally straightforward to remove center from all the expanding_* functions.

@seth-p
Copy link
Contributor

seth-p commented Aug 7, 2014

I removed center from all the expanding_* functions in #7934.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants