-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Whitelist std and var for use with custom rolling windows #33448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/azp run |
Commenter does not have sufficient privileges for PR 33448 in repo pandas-dev/pandas |
Well, better restart the old-fashioned way then... |
Could you expand on this test (or create a new one) that checks the result of the |
also if you would like to expand the docs of std/var (for windows) would be helpful (e.g. adding explanations like above) |
@mroeschke |
@jreback Also cleaned up a bit. Bessel-adjusted sample standard deviation is a tautology. All sample estimates of variance are Bessel-adjusted for degrees of freedom. |
Please note that :meth:`~Rolling.std` and :meth:`~Rolling.var` use the sample | ||
variance formula by default, i.e. the sum of squared differences is divided by | ||
``window_size - 1`` and not by ``window_size`` during averaging. In statistics, | ||
we use sample when the dataset is drawn from a larger population that we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Looks like you have 2 spaces between sample
and when
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -97,13 +97,57 @@ def get_window_bounds(self, num_values, min_periods, center, closed): | |||
|
|||
@pytest.mark.parametrize("constructor", [Series, DataFrame]) | |||
@pytest.mark.parametrize( | |||
"func,alt_func,expected", | |||
"func,np_func,expected,pd_kwargs,np_kwargs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove pd_kwargs
for now since it empty in all these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Numpy dev pipeline is failing tests. Appears to be because of dev cython version and was reported in #33507 . |
@mroeschke @jreback |
Okay, moving on to the next function. @jreback, can we merge this PR? |
thanks! |
…#33448) * stop throwing NotImplemented on std and var * DOC: edit whatsnew * restart checks * restart checks * TST: add kwargs to tests * TST: add tests for std and var * DOC: expand documentation on sample variance * CLN: remove trailing whitespace * CLN: remove double space * CLN: remove pd_kwargs from the test
…#33448) * stop throwing NotImplemented on std and var * DOC: edit whatsnew * restart checks * restart checks * TST: add kwargs to tests * TST: add tests for std and var * DOC: expand documentation on sample variance * CLN: remove trailing whitespace * CLN: remove double space * CLN: remove pd_kwargs from the test
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The problem
While researching what funcitons are broken for #32865 , I added
std
andvar
to the list since their output didn't match numpy output. I have since discovered that this is because we default to the sample variance formula for all window calculations.After closer examination, the algorithm itself turned out to be very robust against custom indexers. It is even resilient against non-monothonic window starts and ends.
There is nothing to do there, so we should revert blacklisting
std
andvar
. I don't think tests are necessary, since we aren't changing anything.Food for thought
Some background information to make our decision here more informed:
The reason I first believed these functions to be broken is because using the sample variance formula for sliding windows makes no sense to me from a statistical viewpoint. We use sample variance when the dataset is a sample drawn from a larger population. A window is not a sample. When we calculate sliding window variance, we aren't interested in getting the correct variance for some underlying general window, we are interested in computing it correctly for each window, and thus each window is the population.
However, as we discussed with @mroeschke:
1 / (window_size - 1) - 1 / window_size
rolling.var(ddof=0)
to set degrees of freedom to zero and get population variance, if they know what they want and are aware that pandas uses sample variance by default.So the default doesn't make much sense, but it is consistent with the rest of our software, and the harm is negligible for most use cases. The harm in changing the default would be that people who know the package well might expect the sample variance default.
Apologies for the long read. It's a part of my job to find mistakes in data science models, so I'm sensitive to stuff like this.