Skip to content

API: meaning of min_periods for ewm*() functions? #7977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
seth-p opened this issue Aug 10, 2014 · 9 comments · Fixed by #7926
Closed

API: meaning of min_periods for ewm*() functions? #7977

seth-p opened this issue Aug 10, 2014 · 9 comments · Fixed by #7926
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@seth-p
Copy link
Contributor

seth-p commented Aug 10, 2014

The interpretation of min_periods in the ewm*() functions seems rather odd to me. For example (in 0.14.1):

In [19]: x
Out[19]:
0     0
1   NaN
2   NaN
3   NaN
4     4
5   NaN
6     6
dtype: float64

In [20]: ewma(x, com=3., min_periods=2)
Out[20]:
0         NaN
1         NaN
2    0.000000
3    0.000000
4    2.285714
5    2.285714
6    3.891892
dtype: float64

The way it works, is it finds the first non-NaN value (0 in the example above) and then makes sure that the min_periods entries (min_periods-1 in 0.15.0, per #7898) in the result starting at that entry are NaN. Does it make any sense that the result has entry 0 set to NaN, but entries 2 and 3 (and 1 in 0.15.0) set to 0.0?

I would have thought that the values to be explicitly NaNed would be those determined by x.notnull().cumsum() < min_periods. This would be consistent with the meaning of min_periods in the rolling_*() and expanding_*() functions.

CC'ing @snth and @jaimefrio, in case they have opinions.

@seth-p
Copy link
Contributor Author

seth-p commented Aug 16, 2014

Anyone have an opinion on this?

Specifically, I would replace

        if min_periods > 1:
            first_index = _first_valid_index(v)
            result[first_index: first_index + min_periods - 1] = NaN

with

        if min_periods > 1:
            result[notnull(v).cumsum() < min_periods] = NaN

I suppose another alternative is simply:

        if min_periods > 1:
            result[: min_periods - 1] = NaN

It just seems odd to me that input NaNs are taken into account in determining when to start the NaNing-out window, but not when it ends. I would think the presence of input NaNs should be taken into account in determining both the start and end of the NaNing-out window (i.e. result[notnull(v).cumsum() < min_periods] = NaN, which is consistent with the rolling_*() and expanding_*() functions), or completely ignored (i.e. result[: min_periods - 1] = NaN).

Given that the ewm*() functions are essentially expanding functions, I think it makes most sense for their treatment of min_periodsto be consistent with theexpanding_()(androlling_()) functions, i.e. result[notnull(v).cumsum() < min_periods] = NaN`.

@jreback
Copy link
Contributor

jreback commented Aug 16, 2014

@seth-p I think this is right, can you give an example using (your second alternative to set nans)?

@seth-p
Copy link
Contributor Author

seth-p commented Aug 17, 2014

Here are some examples of data frames showing:
(a) a Series s;
(b) expanding_mean(s, min_periods=2) (just for comparison);
(c) ewma(s, com=3.0, min_periods=2) as produced in v0.14.1;
(d) ewma(s, com=3.0, min_periods=2) as currently produced in master, per #7884;
(e) ewma(s, com=3.0, min_periods=2) as I propose using result[s.notnull().cumsum() < min_periods] = NaN -- Alt 1; and
(f) ewma(s, com=3.0, min_periods=2) using result[:(min_periods-1)] = NaN -- Alt 2.

     s  expanding_mean(s,  ewma(s, com=3.0, min_periods=2)
           min_periods=2)                          v0.14.1   GH 7884     Alt 1     Alt 2
   (a)                (b)                              (c)       (d)       (e)       (f)
0    1                NaN                              NaN       NaN       NaN       NaN
1    2                1.5                              NaN  1.571429  1.571429  1.571429
2    3                2.0                         2.189189  2.189189  2.189189  2.189189
3    4                2.5                         2.851429  2.851429  2.851429  2.851429

     s  expanding_mean(s,  ewma(s, com=3.0, min_periods=2)
           min_periods=2)                          v0.14.1   GH 7884     Alt 1     Alt 2
   (a)                (b)                              (c)       (d)       (e)       (f)
0  NaN                NaN                              NaN       NaN       NaN       NaN
1    1                NaN                              NaN       NaN       NaN  1.000000
2    2                1.5                              NaN  1.571429  1.571429  1.571429
3    3                2.0                         2.189189  2.189189  2.189189  2.189189
4    4                2.5                         2.851429  2.851429  2.851429  2.851429

     s  expanding_mean(s,  ewma(s, com=3.0, min_periods=2)
           min_periods=2)                          v0.14.1   GH 7884     Alt 1     Alt 2
   (a)                (b)                              (c)       (d)       (e)       (f)
0  NaN                NaN                              NaN       NaN       NaN       NaN
1  NaN                NaN                              NaN       NaN       NaN       NaN
2    1                NaN                              NaN       NaN       NaN  1.000000
3    2                1.5                              NaN  1.571429  1.571429  1.571429
4    3                2.0                         2.189189  2.189189  2.189189  2.189189
5    4                2.5                         2.851429  2.851429  2.851429  2.851429

     s  expanding_mean(s,  ewma(s, com=3.0, min_periods=2)
           min_periods=2)                          v0.14.1   GH 7884     Alt 1     Alt 2
   (a)                (b)                              (c)       (d)       (e)       (f)
0  NaN                NaN                              NaN       NaN       NaN       NaN
1  NaN                NaN                              NaN       NaN       NaN       NaN
2    1                NaN                              NaN       NaN       NaN  1.000000
3  NaN                NaN                              NaN  1.000000       NaN  1.000000
4  NaN                NaN                         1.000000  1.000000       NaN  1.000000
5    2                1.5                         1.571429  1.571429  1.571429  1.571429
6    3                2.0                         2.189189  2.189189  2.189189  2.189189
7    4                2.5                         2.851429  2.851429  2.851429  2.851429

Alt 1: result[s.notnull().cumsum() < min_periods] = NaN
Alt 2: result[:(min_periods-1)] = NaN

The following is the code that produced the tables above.

from pandas import DataFrame, MultiIndex, Series, expanding_mean, ewma, options, version
from numpy import NaN


def make_df(s, com, min_periods):
    df = DataFrame(index=s.index,
                   columns = MultiIndex.from_tuples([('s', '', '(a)'),
                                                     ('expanding_mean(s,', ' min_periods={})'.format(min_periods), '(b)'),
                                                     ('ewma(s, com={}, min_periods={})'.format(com, min_periods), 'v0.14.1', '(c)'),
                                                     ('ewma(s, com={}, min_periods={})'.format(com, min_periods), 'GH 7884', '(d)'),
                                                     ('ewma(s, com={}, min_periods={})'.format(com, min_periods), 'Alt 1', '(e)'),
                                                     ('ewma(s, com={}, min_periods={})'.format(com, min_periods), 'Alt 2', '(f)'),
                                                    ]))
    df.iloc[:, 0] = s
    df.iloc[:, 1] = expanding_mean(s, min_periods=min_periods)
    if tuple(version.version.split('.')) > ('0', '14', '1'):
        min_periods += 1
    df.iloc[:, 2] = ewma(s, com=com, min_periods=min_periods)
    df.iloc[:, 3] = ewma(s, com=com, min_periods=(min_periods-1))
    y = ewma(s, com=com, min_periods=0)
    alt1 = y.copy()
    alt1[s.notnull().cumsum() < min_periods] = NaN
    alt2 = y.copy()
    alt2[:(min_periods-1)] = NaN
    df.iloc[:, 4] = alt1
    df.iloc[:, 5] = alt2
    return df

options.display.width = 100
for s in [Series([1,2,3,4]),
          Series([None,1,2,3,4]),
          Series([None,None,1,2,3,4]),
          Series([None,None,1,None,None,2,3,4]),
         ]:
    print(make_df(s, com=3., min_periods=2))
    print()

print("Alt 1: result[s.notnull().cumsum() < min_periods] = NaN")
print("Alt 2: result[:(min_periods-1)] = NaN")

@seth-p
Copy link
Contributor Author

seth-p commented Aug 18, 2014

Anyone have any thoughts? As I mentioned above, I favor "Alt 1", for consistency with the rolling/expanding_*() functions.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2014

I agree on Alt 1 (Alt 2 is a bit greedy).

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Aug 18, 2014
@seth-p
Copy link
Contributor Author

seth-p commented Aug 18, 2014

OK, I will incorporate this into #7926.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2014

maybe add something like the above chart into the release notes (but only show the 0.14.1 and the new)

@seth-p
Copy link
Contributor Author

seth-p commented Aug 18, 2014

Good idea.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2014

also could be a note in computation.rst (as sort of an edge case example)

@seth-p seth-p changed the title BUG/API(?) meaning of min_periods for ewm*() functions? API: meaning of min_periods for ewm*() functions? Aug 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
2 participants