Skip to content

BUG: low variance arrays' kurtosis wrongfully set to zero #58176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ Bug fixes
- Fixed bug in :meth:`Series.diff` allowing non-integer values for the ``periods`` argument. (:issue:`56607`)
- Fixed bug in :meth:`Series.rank` that doesn't preserve missing values for nullable integers when ``na_option='keep'``. (:issue:`56976`)
- Fixed bug in :meth:`Series.replace` and :meth:`DataFrame.replace` inconsistently replacing matching instances when ``regex=True`` and missing values are present. (:issue:`56599`)
- Fixed bug in :meth:`Series.rolling.kurt` with low variance arrays getting zeroed out even when numerically stable (:issue:`57972`)

Categorical
^^^^^^^^^^^
Expand Down
3 changes: 2 additions & 1 deletion pandas/_libs/window/aggregations.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -712,7 +712,8 @@ cdef float64_t calc_kurt(int64_t minp, int64_t nobs,
# if the variance is less than 1e-14, it could be
# treat as zero, here we follow the original
# skew/kurt behaviour to check B <= 1e-14
if B <= 1e-14:
# #57972: for non-zero but low variance arrays the cutoff can be lowered
if B <= 1e-281:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this constant? The minimum number of significant decimal digits that have guaranteed precision for a double is 15, which I assume is where 1e-14 came from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the variance of the observations. The e-14 cutoff is too conservative and also sets numerically stable results to NaN, e.g. when the mean of the observations is very low.

I did some more testing after your comment and setting the cutoff as low as in nanops (e-281) prevents the false positives, but it also lets numerically unstable results pass, so I reverted it. Schemes that take into account the mean etc. of the observations in the cutoff were also not really satisfactory.

I'll look into this and make another PR if I find some satisfactory solution for the equation in the .pyx here. Numerically it behaves very different from the one in nanops.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since double precision is only guaranteed up to 15 significant decimal digits across implementations choosing anything smaller than 1e-14 is not going to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The e-14 might make one think that it is about a float's number of significant digits, but B <= e-14 only checks for potential numerical instabilities irrespective of that.

E.g.,

n = 1_000_0
scale = 1e12
data = np.array([2.3000001*scale, 2.3*scale]*n)

Is numerically unstable, though the variance is larger than e-14 (rolling kurt = e15).

On the other hand

n = 1_000_0
scale = 1e-15
data = np.array([2.4*scale, 2.3*scale]*n)

Is numerically stable, but the variance is smaller than e-14 (rolling kurt = 2.5).

In nanops the equations become unstable only around e-281 in my tests, but here it's more complex.

Example code:

import numpy as np
import scipy.stats as st
n = 1_000_0
scale = 1e12
data = np.array([2.3000001*scale, 2.3*scale]*n)

pdkurt = pd.Series(data).kurt()
scipykurt = st.kurtosis(data, bias=False)
print(pdkurt)
print(scipykurt)
print(pd.Series(data).rolling(10).kurt())

n = 1_000_0
scale = 1e-15
data = np.array([2.4*scale, 2.3*scale]*n)

pdkurt = pd.Series(data).kurt()
scipykurt = st.kurtosis(data, bias=False)
print(pdkurt)
print(scipykurt)
print(pd.Series(data).rolling(10).kurt())

Copy link
Member

@WillAyd WillAyd Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a whitepaper that lays out what you are trying to accomplish? The problem with your local results is that it is dependent on your hardware, and floating point implementations an vary.

I don't believe that a statement like # scipy.kurt is nan at e-81 is generally True (nan can be generated from quite a few different patterns, although technically platforms should be choosing one canonical pattern), and the e-72 and e-281 sentinels seem arbitrary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree adjusting this arbitrary limit is not ideal. IMO ideally we shouldn't have one in the first place. We removed a similar arbitrary limit for var and std a few releases ago #40505

I would support just removing this limit and just documenting floating point precision artifacts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points @WillAyd, @mroeschke.

I was setting the limit(s) to better reflect the actual stability range of the calculations, but the limit is arbitrary due to dependencies on inputs as well as machine platforms.

For the nanops version I would agree with removing the check, since it was introduced before the equation there was stabilised. From my tests the kurt calculation there is about as stable as the scipy implementation and only becomes unstable for very extreme cases.

From my tests, the kurt implementation in aggregation.pyx here seems comparatively much more unstable. The equations involved have a lot of potential cancellations. I would suggest first stabilising the equations there. That's something I could do.

result = NaN
else:
K = (dnobs * dnobs - 1.) * D / (B * B) - 3 * ((dnobs - 1.) ** 2)
Expand Down
16 changes: 10 additions & 6 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1357,9 +1357,13 @@ def nankurt(
# floating point error
#
# #18044 in _libs/windows.pyx calc_kurt follow this behavior
# to fix the fperr to treat denom <1e-14 as zero
numerator = _zero_out_fperr(numerator)
denominator = _zero_out_fperr(denominator)
# to fix the fperr to treat denom <1e-14 as zero (default cutoff)
# GH-57972 set cutoff lower for low variance arrays to prevent cutoff of otherwise
# numerically stable values. Scipy.kurtosis and this implementation start
# diverging for examples with cutoffs below e-281
cutoff = 1e-281
numerator = _zero_out_fperr(numerator, cutoff)
denominator = _zero_out_fperr(denominator, cutoff)

if not isinstance(denominator, np.ndarray):
# if ``denom`` is a scalar, check these corner cases first before
Expand Down Expand Up @@ -1576,12 +1580,12 @@ def check_below_min_count(
return False


def _zero_out_fperr(arg):
def _zero_out_fperr(arg, cutoff=1e-14):
# #18044 reference this behavior to fix rolling skew/kurt issue
if isinstance(arg, np.ndarray):
return np.where(np.abs(arg) < 1e-14, 0, arg)
return np.where(np.abs(arg) < cutoff, 0, arg)
else:
return arg.dtype.type(0) if np.abs(arg) < 1e-14 else arg
return arg.dtype.type(0) if np.abs(arg) < cutoff else arg


@disallow("M8", "m8")
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/test_nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -1105,6 +1105,21 @@ def test_nans_skipna(self, samples, actual_kurt):
kurt = nanops.nankurt(samples, skipna=True)
tm.assert_almost_equal(kurt, actual_kurt)

def test_arrays_with_low_variance(self):
# GH-57972
# sample arrays with low variance have a lower threshold for breakdown
# of numerical stability and should be handled accordingly
n = 10_000
n2 = 10
# scipy.kurt is nan at e-81,
# both kurtosis start diverging from each other around e-76
scale = 1e-72
low_var = np.array([-2.3 * scale] * n2 + [-4.1 * scale] * n2 + [0.0] * n)
# calculated with scipy.status kurtosis(low_var_samples, bias=False)
scipy_kurt = 632.556235239126
kurt = nanops.nankurt(low_var)
tm.assert_almost_equal(kurt, scipy_kurt)

@property
def prng(self):
return np.random.default_rng(2)
Expand Down
Loading