Whitelist std and var for use with custom rolling windows #33448

AlexKirko · 2020-04-10T07:21:48Z

xref BUG: rolling window functions don't support custom indexers #32865
0 tests added / 0 passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The problem

While researching what funcitons are broken for #32865 , I added std and var to the list since their output didn't match numpy output. I have since discovered that this is because we default to the sample variance formula for all window calculations.
After closer examination, the algorithm itself turned out to be very robust against custom indexers. It is even resilient against non-monothonic window starts and ends.
There is nothing to do there, so we should revert blacklisting std and var. I don't think tests are necessary, since we aren't changing anything.

Food for thought

Some background information to make our decision here more informed:

The reason I first believed these functions to be broken is because using the sample variance formula for sliding windows makes no sense to me from a statistical viewpoint. We use sample variance when the dataset is a sample drawn from a larger population. A window is not a sample. When we calculate sliding window variance, we aren't interested in getting the correct variance for some underlying general window, we are interested in computing it correctly for each window, and thus each window is the population.

However, as we discussed with @mroeschke:

Usually users are interested in calculating variance for large windows, and the difference between formulas for variance is proportional to 1 / (window_size - 1) - 1 / window_size
We use sample variance as a default everywhere else in pandas.
The user can specify rolling.var(ddof=0) to set degrees of freedom to zero and get population variance, if they know what they want and are aware that pandas uses sample variance by default.

So the default doesn't make much sense, but it is consistent with the rest of our software, and the harm is negligible for most use cases. The harm in changing the default would be that people who know the package well might expect the sample variance default.

Apologies for the long read. It's a part of my job to find mistakes in data science models, so I'm sensitive to stuff like this.

AlexKirko · 2020-04-10T08:50:02Z

/azp run

azure-pipelines · 2020-04-10T08:50:06Z

Commenter does not have sufficient privileges for PR 33448 in repo pandas-dev/pandas

AlexKirko · 2020-04-10T08:50:33Z

Well, better restart the old-fashioned way then...

mroeschke · 2020-04-10T16:21:16Z

Could you expand on this test (or create a new one) that checks the result of the std/var operation?

https://github.com/pandas-dev/pandas/pull/33180/files#diff-c0ea38b081d5b0c3cf20dc8646f38cefR106

jreback · 2020-04-10T16:55:52Z

also if you would like to expand the docs of std/var (for windows) would be helpful (e.g. adding explanations like above)

AlexKirko · 2020-04-11T08:18:58Z

@mroeschke
Sure! Restructured the test a bit to accomodate the required flexibility.

AlexKirko · 2020-04-11T08:55:20Z

@jreback
Done. In addition, referenced the new note in the expanding windows section, as the user should also be conscious of what they are doing when working with expanding wnidows, same as with rolling ones.

Also cleaned up a bit. Bessel-adjusted sample standard deviation is a tautology. All sample estimates of variance are Bessel-adjusted for degrees of freedom.

mroeschke · 2020-04-12T21:48:31Z

doc/source/user_guide/computation.rst

+   Please note that :meth:`~Rolling.std` and :meth:`~Rolling.var` use the sample
+   variance formula by default, i.e. the sum of squared differences is divided by
+   ``window_size - 1`` and not by ``window_size`` during averaging. In statistics,
+   we use sample  when the dataset is drawn from a larger population that we


Nit: Looks like you have 2 spaces between sample and when

mroeschke · 2020-04-12T21:51:04Z

pandas/tests/window/test_base_indexer.py

@@ -97,13 +97,57 @@ def get_window_bounds(self, num_values, min_periods, center, closed):

 @pytest.mark.parametrize("constructor", [Series, DataFrame])
 @pytest.mark.parametrize(
-    "func,alt_func,expected",
+    "func,np_func,expected,pd_kwargs,np_kwargs",


We can remove pd_kwargs for now since it empty in all these cases?

AlexKirko · 2020-04-13T07:34:33Z

Numpy dev pipeline is failing tests. Appears to be because of dev cython version and was reported in #33507 .

AlexKirko · 2020-04-14T05:57:42Z

@mroeschke @jreback
All comments addressed, all green.

AlexKirko · 2020-04-15T08:44:52Z

Okay, moving on to the next function.

@jreback, can we merge this PR?

jreback · 2020-04-17T02:54:59Z

thanks!

…#33448) * stop throwing NotImplemented on std and var * DOC: edit whatsnew * restart checks * restart checks * TST: add kwargs to tests * TST: add tests for std and var * DOC: expand documentation on sample variance * CLN: remove trailing whitespace * CLN: remove double space * CLN: remove pd_kwargs from the test

AlexKirko added 2 commits April 9, 2020 11:50

stop throwing NotImplemented on std and var

cb76506

DOC: edit whatsnew

5062fb5

AlexKirko added 2 commits April 10, 2020 11:50

restart checks

ada2a7d

restart checks

3c712c9

jreback added Window rolling, ewma, expanding Docs labels Apr 10, 2020

AlexKirko added 2 commits April 11, 2020 10:40

TST: add kwargs to tests

eece9e7

TST: add tests for std and var

2cf13c7

DOC: expand documentation on sample variance

9f3540e

CLN: remove trailing whitespace

2351c4f

mroeschke reviewed Apr 12, 2020

View reviewed changes

AlexKirko added 2 commits April 13, 2020 08:56

CLN: remove double space

9f286aa

CLN: remove pd_kwargs from the test

0318f88

Merge branch 'master' into rolling-std-var

f4cf6fc

AlexKirko requested a review from mroeschke April 14, 2020 05:57

mroeschke approved these changes Apr 14, 2020

View reviewed changes

jreback added this to the 1.1 milestone Apr 17, 2020

jreback approved these changes Apr 17, 2020

View reviewed changes

jreback merged commit 0e37717 into pandas-dev:master Apr 17, 2020

AlexKirko deleted the rolling-std-var branch April 17, 2020 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitelist std and var for use with custom rolling windows #33448

Whitelist std and var for use with custom rolling windows #33448

AlexKirko commented Apr 10, 2020 •

edited

Loading

AlexKirko commented Apr 10, 2020

azure-pipelines bot commented Apr 10, 2020

AlexKirko commented Apr 10, 2020

mroeschke commented Apr 10, 2020

jreback commented Apr 10, 2020

AlexKirko commented Apr 11, 2020

AlexKirko commented Apr 11, 2020 •

edited

Loading

mroeschke Apr 12, 2020

AlexKirko Apr 13, 2020

mroeschke Apr 12, 2020

AlexKirko Apr 13, 2020

AlexKirko commented Apr 13, 2020 •

edited

Loading

AlexKirko commented Apr 14, 2020 •

edited

Loading

AlexKirko commented Apr 15, 2020

jreback commented Apr 17, 2020

Whitelist std and var for use with custom rolling windows #33448

Whitelist std and var for use with custom rolling windows #33448

Conversation

AlexKirko commented Apr 10, 2020 • edited Loading

The problem

Food for thought

AlexKirko commented Apr 10, 2020

azure-pipelines bot commented Apr 10, 2020

AlexKirko commented Apr 10, 2020

mroeschke commented Apr 10, 2020

jreback commented Apr 10, 2020

AlexKirko commented Apr 11, 2020

AlexKirko commented Apr 11, 2020 • edited Loading

mroeschke Apr 12, 2020

Choose a reason for hiding this comment

AlexKirko Apr 13, 2020

Choose a reason for hiding this comment

mroeschke Apr 12, 2020

Choose a reason for hiding this comment

AlexKirko Apr 13, 2020

Choose a reason for hiding this comment

AlexKirko commented Apr 13, 2020 • edited Loading

AlexKirko commented Apr 14, 2020 • edited Loading

AlexKirko commented Apr 15, 2020

jreback commented Apr 17, 2020

AlexKirko commented Apr 10, 2020 •

edited

Loading

AlexKirko commented Apr 11, 2020 •

edited

Loading

AlexKirko commented Apr 13, 2020 •

edited

Loading

AlexKirko commented Apr 14, 2020 •

edited

Loading