Skip to content

Whitelist std and var for use with custom rolling windows #33448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 17, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Apr 10, 2020

The problem

While researching what funcitons are broken for #32865 , I added std and var to the list since their output didn't match numpy output. I have since discovered that this is because we default to the sample variance formula for all window calculations.
After closer examination, the algorithm itself turned out to be very robust against custom indexers. It is even resilient against non-monothonic window starts and ends.
There is nothing to do there, so we should revert blacklisting std and var. I don't think tests are necessary, since we aren't changing anything.

Food for thought

Some background information to make our decision here more informed:

The reason I first believed these functions to be broken is because using the sample variance formula for sliding windows makes no sense to me from a statistical viewpoint. We use sample variance when the dataset is a sample drawn from a larger population. A window is not a sample. When we calculate sliding window variance, we aren't interested in getting the correct variance for some underlying general window, we are interested in computing it correctly for each window, and thus each window is the population.

However, as we discussed with @mroeschke:

  1. Usually users are interested in calculating variance for large windows, and the difference between formulas for variance is proportional to 1 / (window_size - 1) - 1 / window_size
  2. We use sample variance as a default everywhere else in pandas.
  3. The user can specify rolling.var(ddof=0) to set degrees of freedom to zero and get population variance, if they know what they want and are aware that pandas uses sample variance by default.

So the default doesn't make much sense, but it is consistent with the rest of our software, and the harm is negligible for most use cases. The harm in changing the default would be that people who know the package well might expect the sample variance default.

Apologies for the long read. It's a part of my job to find mistakes in data science models, so I'm sensitive to stuff like this.

@AlexKirko
Copy link
Member Author

/azp run

@azure-pipelines
Copy link
Contributor

Commenter does not have sufficient privileges for PR 33448 in repo pandas-dev/pandas

@AlexKirko
Copy link
Member Author

Well, better restart the old-fashioned way then...

@mroeschke
Copy link
Member

Could you expand on this test (or create a new one) that checks the result of the std/var operation?

https://github.com/pandas-dev/pandas/pull/33180/files#diff-c0ea38b081d5b0c3cf20dc8646f38cefR106

@jreback jreback added Window rolling, ewma, expanding Docs labels Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

also if you would like to expand the docs of std/var (for windows) would be helpful (e.g. adding explanations like above)

@AlexKirko
Copy link
Member Author

@mroeschke
Sure! Restructured the test a bit to accomodate the required flexibility.

@AlexKirko
Copy link
Member Author

AlexKirko commented Apr 11, 2020

@jreback
Done. In addition, referenced the new note in the expanding windows section, as the user should also be conscious of what they are doing when working with expanding wnidows, same as with rolling ones.

Also cleaned up a bit. Bessel-adjusted sample standard deviation is a tautology. All sample estimates of variance are Bessel-adjusted for degrees of freedom.

Please note that :meth:`~Rolling.std` and :meth:`~Rolling.var` use the sample
variance formula by default, i.e. the sum of squared differences is divided by
``window_size - 1`` and not by ``window_size`` during averaging. In statistics,
we use sample when the dataset is drawn from a larger population that we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Looks like you have 2 spaces between sample and when

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -97,13 +97,57 @@ def get_window_bounds(self, num_values, min_periods, center, closed):

@pytest.mark.parametrize("constructor", [Series, DataFrame])
@pytest.mark.parametrize(
"func,alt_func,expected",
"func,np_func,expected,pd_kwargs,np_kwargs",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove pd_kwargs for now since it empty in all these cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@AlexKirko
Copy link
Member Author

AlexKirko commented Apr 13, 2020

Numpy dev pipeline is failing tests. Appears to be because of dev cython version and was reported in #33507 .

@AlexKirko
Copy link
Member Author

AlexKirko commented Apr 14, 2020

@mroeschke @jreback
All comments addressed, all green.

@AlexKirko AlexKirko requested a review from mroeschke April 14, 2020 05:57
@AlexKirko
Copy link
Member Author

Okay, moving on to the next function.

@jreback, can we merge this PR?

@jreback jreback added this to the 1.1 milestone Apr 17, 2020
@jreback jreback merged commit 0e37717 into pandas-dev:master Apr 17, 2020
@jreback
Copy link
Contributor

jreback commented Apr 17, 2020

thanks!

@AlexKirko AlexKirko deleted the rolling-std-var branch April 17, 2020 06:27
CloseChoice pushed a commit to CloseChoice/pandas that referenced this pull request Apr 20, 2020
…#33448)

* stop throwing NotImplemented on std and var

* DOC: edit whatsnew

* restart checks

* restart checks

* TST: add kwargs to tests

* TST: add tests for std and var

* DOC: expand documentation on sample variance

* CLN: remove trailing whitespace

* CLN: remove double space

* CLN: remove pd_kwargs from the test
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
…#33448)

* stop throwing NotImplemented on std and var

* DOC: edit whatsnew

* restart checks

* restart checks

* TST: add kwargs to tests

* TST: add tests for std and var

* DOC: expand documentation on sample variance

* CLN: remove trailing whitespace

* CLN: remove double space

* CLN: remove pd_kwargs from the test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants