Skip to content

Commit 0e37717

Browse files
authored
Whitelist std and var for use with custom rolling windows (#33448)
* stop throwing NotImplemented on std and var * DOC: edit whatsnew * restart checks * restart checks * TST: add kwargs to tests * TST: add tests for std and var * DOC: expand documentation on sample variance * CLN: remove trailing whitespace * CLN: remove double space * CLN: remove pd_kwargs from the test
1 parent 1834db2 commit 0e37717

File tree

4 files changed

+88
-12
lines changed

4 files changed

+88
-12
lines changed

doc/source/user_guide/computation.rst

+31-4
Original file line numberDiff line numberDiff line change
@@ -312,15 +312,35 @@ We provide a number of common statistical functions:
312312
:meth:`~Rolling.median`, Arithmetic median of values
313313
:meth:`~Rolling.min`, Minimum
314314
:meth:`~Rolling.max`, Maximum
315-
:meth:`~Rolling.std`, Bessel-corrected sample standard deviation
316-
:meth:`~Rolling.var`, Unbiased variance
315+
:meth:`~Rolling.std`, Sample standard deviation
316+
:meth:`~Rolling.var`, Sample variance
317317
:meth:`~Rolling.skew`, Sample skewness (3rd moment)
318318
:meth:`~Rolling.kurt`, Sample kurtosis (4th moment)
319319
:meth:`~Rolling.quantile`, Sample quantile (value at %)
320320
:meth:`~Rolling.apply`, Generic apply
321321
:meth:`~Rolling.cov`, Unbiased covariance (binary)
322322
:meth:`~Rolling.corr`, Correlation (binary)
323323

324+
.. _computation.window_variance.caveats:
325+
326+
.. note::
327+
328+
Please note that :meth:`~Rolling.std` and :meth:`~Rolling.var` use the sample
329+
variance formula by default, i.e. the sum of squared differences is divided by
330+
``window_size - 1`` and not by ``window_size`` during averaging. In statistics,
331+
we use sample when the dataset is drawn from a larger population that we
332+
don't have access to. Using it implies that the data in our window is a
333+
random sample from the population, and we are interested not in the variance
334+
inside the specific window but in the variance of some general window that
335+
our windows represent. In this situation, using the sample variance formula
336+
results in an unbiased estimator and so is preferred.
337+
338+
Usually, we are instead interested in the variance of each window as we slide
339+
it over the data, and in this case we should specify ``ddof=0`` when calling
340+
these methods to use population variance instead of sample variance. Using
341+
sample variance under the circumstances would result in a biased estimator
342+
of the variable we are trying to determine.
343+
324344
.. _stats.rolling_apply:
325345

326346
Rolling apply
@@ -848,15 +868,22 @@ Method summary
848868
:meth:`~Expanding.median`, Arithmetic median of values
849869
:meth:`~Expanding.min`, Minimum
850870
:meth:`~Expanding.max`, Maximum
851-
:meth:`~Expanding.std`, Unbiased standard deviation
852-
:meth:`~Expanding.var`, Unbiased variance
871+
:meth:`~Expanding.std`, Sample standard deviation
872+
:meth:`~Expanding.var`, Sample variance
853873
:meth:`~Expanding.skew`, Unbiased skewness (3rd moment)
854874
:meth:`~Expanding.kurt`, Unbiased kurtosis (4th moment)
855875
:meth:`~Expanding.quantile`, Sample quantile (value at %)
856876
:meth:`~Expanding.apply`, Generic apply
857877
:meth:`~Expanding.cov`, Unbiased covariance (binary)
858878
:meth:`~Expanding.corr`, Correlation (binary)
859879

880+
.. note::
881+
882+
Using sample variance formulas for :meth:`~Expanding.std` and
883+
:meth:`~Expanding.var` comes with the same caveats as using them with rolling
884+
windows. See :ref:`this section <computation.window_variance.caveats>` for more
885+
information.
886+
860887
.. currentmodule:: pandas
861888

862889
Aside from not having a ``window`` parameter, these functions have the same

doc/source/whatsnew/v1.1.0.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ Other API changes
174174
- Added :meth:`DataFrame.value_counts` (:issue:`5377`)
175175
- :meth:`Groupby.groups` now returns an abbreviated representation when called on large dataframes (:issue:`1135`)
176176
- ``loc`` lookups with an object-dtype :class:`Index` and an integer key will now raise ``KeyError`` instead of ``TypeError`` when key is missing (:issue:`31905`)
177-
- Using a :func:`pandas.api.indexers.BaseIndexer` with ``std``, ``var``, ``count``, ``skew``, ``cov``, ``corr`` will now raise a ``NotImplementedError`` (:issue:`32865`)
177+
- Using a :func:`pandas.api.indexers.BaseIndexer` with ``count``, ``skew``, ``cov``, ``corr`` will now raise a ``NotImplementedError`` (:issue:`32865`)
178178
- Using a :func:`pandas.api.indexers.BaseIndexer` with ``min``, ``max`` will now return correct results for any monotonic :func:`pandas.api.indexers.BaseIndexer` descendant (:issue:`32865`)
179179
- Added a :func:`pandas.api.indexers.FixedForwardWindowIndexer` class to support forward-looking windows during ``rolling`` operations.
180180
-

pandas/core/window/common.py

+11-1
Original file line numberDiff line numberDiff line change
@@ -327,7 +327,17 @@ def func(arg, window, min_periods=None):
327327

328328
def validate_baseindexer_support(func_name: Optional[str]) -> None:
329329
# GH 32865: These functions work correctly with a BaseIndexer subclass
330-
BASEINDEXER_WHITELIST = {"min", "max", "mean", "sum", "median", "kurt", "quantile"}
330+
BASEINDEXER_WHITELIST = {
331+
"min",
332+
"max",
333+
"mean",
334+
"sum",
335+
"median",
336+
"std",
337+
"var",
338+
"kurt",
339+
"quantile",
340+
}
331341
if isinstance(func_name, str) and func_name not in BASEINDEXER_WHITELIST:
332342
raise NotImplementedError(
333343
f"{func_name} is not supported with using a BaseIndexer "

pandas/tests/window/test_base_indexer.py

+45-6
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ def get_window_bounds(self, num_values, min_periods, center, closed):
8282
df.rolling(indexer, win_type="boxcar")
8383

8484

85-
@pytest.mark.parametrize("func", ["std", "var", "count", "skew", "cov", "corr"])
85+
@pytest.mark.parametrize("func", ["count", "skew", "cov", "corr"])
8686
def test_notimplemented_functions(func):
8787
# GH 32865
8888
class CustomIndexer(BaseIndexer):
@@ -97,13 +97,52 @@ def get_window_bounds(self, num_values, min_periods, center, closed):
9797

9898
@pytest.mark.parametrize("constructor", [Series, DataFrame])
9999
@pytest.mark.parametrize(
100-
"func,alt_func,expected",
100+
"func,np_func,expected,np_kwargs",
101101
[
102-
("min", np.min, [0.0, 1.0, 2.0, 3.0, 4.0, 6.0, 6.0, 7.0, 8.0, np.nan]),
103-
("max", np.max, [2.0, 3.0, 4.0, 100.0, 100.0, 100.0, 8.0, 9.0, 9.0, np.nan]),
102+
("min", np.min, [0.0, 1.0, 2.0, 3.0, 4.0, 6.0, 6.0, 7.0, 8.0, np.nan], {},),
103+
(
104+
"max",
105+
np.max,
106+
[2.0, 3.0, 4.0, 100.0, 100.0, 100.0, 8.0, 9.0, 9.0, np.nan],
107+
{},
108+
),
109+
(
110+
"std",
111+
np.std,
112+
[
113+
1.0,
114+
1.0,
115+
1.0,
116+
55.71654452,
117+
54.85739087,
118+
53.9845657,
119+
1.0,
120+
1.0,
121+
0.70710678,
122+
np.nan,
123+
],
124+
{"ddof": 1},
125+
),
126+
(
127+
"var",
128+
np.var,
129+
[
130+
1.0,
131+
1.0,
132+
1.0,
133+
3104.333333,
134+
3009.333333,
135+
2914.333333,
136+
1.0,
137+
1.0,
138+
0.500000,
139+
np.nan,
140+
],
141+
{"ddof": 1},
142+
),
104143
],
105144
)
106-
def test_rolling_forward_window(constructor, func, alt_func, expected):
145+
def test_rolling_forward_window(constructor, func, np_func, expected, np_kwargs):
107146
# GH 32865
108147
values = np.arange(10)
109148
values[5] = 100.0
@@ -124,5 +163,5 @@ def test_rolling_forward_window(constructor, func, alt_func, expected):
124163
result = getattr(rolling, func)()
125164
expected = constructor(expected)
126165
tm.assert_equal(result, expected)
127-
expected2 = constructor(rolling.apply(lambda x: alt_func(x)))
166+
expected2 = constructor(rolling.apply(lambda x: np_func(x, **np_kwargs)))
128167
tm.assert_equal(result, expected2)

0 commit comments

Comments
 (0)