BUG: rolling std can't handle mixture of (relatively) big and small numbers #42064

auderson · 2021-06-17T06:12:34Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.zeros(1000))
df.iloc[0] = 1000
df.rolling(10).std()

Out[124]: 
            0
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
..        ...
995  0.000004
996  0.000004
997  0.000004
998  0.000004
999  0.000004
[1000 rows x 1 columns]

Problem description

If we have a relatively big number at the top, and the remaining is zeros, then the rolling result is not zero

Expected Output

df.iloc[0] = 0
df.rolling(10).std()

Out[125]: 
       0
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
..   ...
995  0.0
996  0.0
997  0.0
998  0.0
999  0.0
[1000 rows x 1 columns]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-53-generic
Version : #60-Ubuntu SMP Thu May 6 07:46:32 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.0
IPython : 7.23.1
pandas_datareader: 0.9.0
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 4.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

The text was updated successfully, but these errors were encountered:

attack68 · 2021-06-17T09:29:09Z

confirmed...

ser = pd.Series(np.zeros(5))
ser.iloc[0] = 1000
ser.rolling(3).std()
0           NaN
1           NaN
2    577.350269
3      0.000008
4      0.000008
dtype: float64

This is actually a result of the rolling.std method using the roll_var method which returns:

[nan nan 3.333e+05 5.821e-11 5.821e-11]
and the numpy square root
[nan nan 5.774e+02 7.629e-06 7.692e-06] which we see in the result.

This is to do with computer precision and the implementation of roll_var which apparently uses:
""" Numerically stable implementation using Welford's method. """. So maybe Welford can be improved here? :)

jreback · 2021-06-17T10:53:47Z

this was fixed for 1.3

attack68 · 2021-06-17T12:36:13Z

@jreback I got my results, and the same problem, on: 1.4.0.dev0+41.gd67f6a678c.dirty

mroeschke · 2021-06-17T17:02:15Z

Note: this is called out in the user guide in the warning block https://pandas.pydata.org/docs/user_guide/window.html#overview

and even thought 1.3 does have a change in rolling.std/var, we note that there will be numerical imprecision as well: https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.3.0.html#removed-artificial-truncation-in-rolling-variance-and-standard-deviation

There will be likely be no definitive fix for this given the algorithm we're using to calculate rolling aggregations

STOpandthink · 2021-07-03T23:57:07Z

Just ran into this bug as well. Kind of nasty, given that you'd totally expect var(same values) to be 0.0 exactly.

mroeschke · 2021-08-21T02:35:35Z

Thanks for the report, but since this is an implementation detail related to our algorithm (and it's been documented). I think this issue can be closed.

auderson · 2022-03-18T09:49:41Z

@mroeschke
Hi! I came across a way to fix this today and I'll expain it below.
If you think this is helpful enough, I'm willing to make a PR.

Code

The code below is a numba version of rolling var, mostly copied from

pandas/pandas/_libs/window/aggregations.pyx

Lines 277 to 405 in 2278923

    
           # Rolling variance 
        
           cdef inline float64_t calc_var(int64_t minp, int ddof, float64_t nobs, 
        
                                          float64_t ssqdm_x) nogil: 
        
               cdef: 
        
                   float64_t result 
        
               # Variance is unchanged if no observation is added or removed 
        
               if (nobs >= minp) and (nobs > ddof): 
        
                   # pathological case 
        
                   if nobs == 1: 
        
                       result = 0 
        
                   else: 
        
                       result = ssqdm_x / (nobs - <float64_t>ddof) 
        
               else: 
        
                   result = NaN 
        
               return result 
        
           cdef inline void add_var(float64_t val, float64_t *nobs, float64_t *mean_x, 
        
                                    float64_t *ssqdm_x, float64_t *compensation) nogil: 
        
               """ add a value from the var calc """ 
        
               cdef: 
        
                   float64_t delta, prev_mean, y, t 
        
               # GH#21813, if msvc 2017 bug is resolved, we should be OK with != instead of `isnan` 
        
               if val != val: 
        
                   return 
        
               nobs[0] = nobs[0] + 1 
        
               # Welford's method for the online variance-calculation 
        
               # using Kahan summation 
        
               # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
        
               prev_mean = mean_x[0] - compensation[0] 
        
               y = val - compensation[0] 
        
               t = y - mean_x[0] 
        
               compensation[0] = t + mean_x[0] - y 
        
               delta = t 
        
               if nobs[0]: 
        
                   mean_x[0] = mean_x[0] + delta / nobs[0] 
        
               else: 
        
                   mean_x[0] = 0 
        
               ssqdm_x[0] = ssqdm_x[0] + (val - prev_mean) * (val - mean_x[0]) 
        
           cdef inline void remove_var(float64_t val, float64_t *nobs, float64_t *mean_x, 
        
                                       float64_t *ssqdm_x, float64_t *compensation) nogil: 
        
               """ remove a value from the var calc """ 
        
               cdef: 
        
                   float64_t delta, prev_mean, y, t 
        
               if val == val: 
        
                   nobs[0] = nobs[0] - 1 
        
                   if nobs[0]: 
        
                       # Welford's method for the online variance-calculation 
        
                       # using Kahan summation 
        
                       # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance 
        
                       prev_mean = mean_x[0] - compensation[0] 
        
                       y = val - compensation[0] 
        
                       t = y - mean_x[0] 
        
                       compensation[0] = t + mean_x[0] - y 
        
                       delta = t 
        
                       mean_x[0] = mean_x[0] - delta / nobs[0] 
        
                       ssqdm_x[0] = ssqdm_x[0] - (val - prev_mean) * (val - mean_x[0]) 
        
                   else: 
        
                       mean_x[0] = 0 
        
                       ssqdm_x[0] = 0 
        
           def roll_var(const float64_t[:] values, ndarray[int64_t] start, 
        
                        ndarray[int64_t] end, int64_t minp, int ddof=1) -> np.ndarray: 
        
               """ 
        
               Numerically stable implementation using Welford's method. 
        
               """ 
        
               cdef: 
        
                   float64_t mean_x, ssqdm_x, nobs, compensation_add, 
        
                   float64_t compensation_remove, 
        
                   float64_t val, prev, delta, mean_x_old 
        
                   int64_t s, e 
        
                   Py_ssize_t i, j, N = len(start) 
        
                   ndarray[float64_t] output 
        
                   bint is_monotonic_increasing_bounds 
        
               minp = max(minp, 1) 
        
               is_monotonic_increasing_bounds = is_monotonic_increasing_start_end_bounds( 
        
                   start, end 
        
               ) 
        
               output = np.empty(N, dtype=np.float64) 
        
               with nogil: 
        
                   for i in range(0, N): 
        
                       s = start[i] 
        
                       e = end[i] 
        
                       # Over the first window, observations can only be added 
        
                       # never removed 
        
                       if i == 0 or not is_monotonic_increasing_bounds or s >= end[i - 1]: 
        
                           mean_x = ssqdm_x = nobs = compensation_add = compensation_remove = 0 
        
                           for j in range(s, e): 
        
                               add_var(values[j], &nobs, &mean_x, &ssqdm_x, &compensation_add) 
        
                       else: 
        
                           # After the first window, observations can both be added 
        
                           # and removed 
        
                           # calculate deletes 
        
                           for j in range(start[i - 1], s): 
        
                               remove_var(values[j], &nobs, &mean_x, &ssqdm_x, 
        
                                          &compensation_remove) 
        
                           # calculate adds 
        
                           for j in range(end[i - 1], e): 
        
                               add_var(values[j], &nobs, &mean_x, &ssqdm_x, &compensation_add) 
        
                       output[i] = calc_var(minp, ddof, nobs, ssqdm_x) 
        
                       if not is_monotonic_increasing_bounds: 
        
                           nobs = 0.0 
        
                           mean_x = 0.0 
        
                           ssqdm_x = 0.0 
        
                           compensation_remove = 0.0 
        
               return output

except that I modified it to work on 2-d inputs.

import timeit
import numpy as np
import pandas as pd
from numba import njit

@njit(nogil=True, inline='always')
def add_var_old(row, nobs, mean_x,
                ssqdm_x, compensation):
    """ add a value from the var calc """
    # cdef:
    #     float64_t delta, prev_mean, y, t
    for j, val in enumerate(row):
        # GH#21813, if msvc 2017 bug is resolved, we should be OK with != instead of `isnan`
        if val != val:
            continue

        nobs[j] += 1

        # Welford's method for the online variance-calculation
        # using Kahan summation
        # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
        prev_mean = mean_x[j] - compensation[j]
        y = val - compensation[j]
        t = y - mean_x[j]
        compensation[j] = t + mean_x[j] - y
        delta = t
        if nobs[j]:
            mean_x[j] = mean_x[j] + delta / nobs[j]
        else:
            mean_x[j] = 0
        ssqdm_x[j] = ssqdm_x[j] + (val - prev_mean) * (val - mean_x[j])


@njit(nogil=True, inline='always')
def calc_var_old(minp, ddof, nobs, ssqdm_x):
    # cdef:
    #     float64_t result

    result = np.empty(len(nobs))

    for i, (_nobs, _ssqdm_x) in enumerate(zip(nobs, ssqdm_x)):
        if (_nobs >= minp) and (_nobs > ddof):
            # pathological case
            if _nobs == 1:
                result[i] = 0
            else:
                result[i] = _ssqdm_x / (_nobs - ddof)
        else:
            result[i] = np.nan

    return result


@njit(nogil=True)
def roll_var_old(values, wind, minp, ddof=1) -> np.ndarray:
    """
    Numerically stable implementation using Welford's method.
    """
    # cdef:
    #     float64_t mean_x, ssqdm_x, nobs, compensation_add,
    #     float64_t compensation_remove,
    #     float64_t val, prev, delta, mean_x_old
    #     int64_t s, e
    #     Py_ssize_t i, j, N = len(start)
    #     ndarray[float64_t] output
    #     bint is_monotonic_increasing_bounds

    minp = max(minp, 1)
    # is_monotonic_increasing_bounds = is_monotonic_increasing_start_end_bounds(
    #     start, end
    # )
    n, m = values.shape
    output = np.empty((n, m), dtype=np.float64)

    mean_x = np.zeros(m)
    ssqdm_x = np.zeros(m)
    nobs = np.zeros(m)
    compensation_add = np.zeros(m)
    compensation_remove = np.zeros(m)

    for i in range(n - wind + 1):

        s = i
        e = wind + s

        # Over the first window, observations can only be added
        # never removed
        if i == 0:
            for j in range(s, e):
                add_var_old(values[j], nobs, mean_x, ssqdm_x, compensation_add)
                output[j] = np.nan
        else:

            # After the first window, observations can both be added
            # and removed

            # calculate deletes
            remove_var(values[s - 1], nobs, mean_x, ssqdm_x, compensation_remove)

            # calculate adds
            add_var_old(values[e - 1], nobs, mean_x, ssqdm_x, compensation_add)

        output[e - 1] = calc_var_old(minp, ddof, nobs, ssqdm_x)

    return output


@njit(nogil=True, inline='always')
def add_var_new(row, nobs, mean_x,
                ssqdm_x, compensation, n_same_value, prev_value):
    """ add a value from the var calc """
    # cdef:
    #     float64_t delta, prev_mean, y, t
    for j, val in enumerate(row):
        # GH#21813, if msvc 2017 bug is resolved, we should be OK with != instead of `isnan`
        if val != val:
            continue

        nobs[j] += 1

        if val == prev_value[j]:
            n_same_value[j] += 1  # incr count
        else:
            n_same_value[j] = 1  # if not same, set count to 1 (include itself)

        prev_value[j] = val  # store prev value

        # Welford's method for the online variance-calculation
        # using Kahan summation
        # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
        prev_mean = mean_x[j] - compensation[j]
        y = val - compensation[j]
        t = y - mean_x[j]
        compensation[j] = t + mean_x[j] - y
        delta = t
        if nobs[j]:
            mean_x[j] = mean_x[j] + delta / nobs[j]
        else:
            mean_x[j] = 0
        ssqdm_x[j] = ssqdm_x[j] + (val - prev_mean) * (val - mean_x[j])

# no change here
@njit(nogil=True, inline='always')
def remove_var(row, nobs, mean_x,
               ssqdm_x, compensation):
    """ remove a value from the var calc """
    # cdef:
    #     float64_t delta, prev_mean, y, t
    for j, val in enumerate(row):
        if val == val:
            nobs[j] -= 1
            if nobs[j]:
                # Welford's method for the online variance-calculation
                # using Kahan summation
                # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
                prev_mean = mean_x[j] - compensation[j]
                y = val - compensation[j]
                t = y - mean_x[j]
                compensation[j] = t + mean_x[j] - y
                delta = t
                mean_x[j] = mean_x[j] - delta / nobs[j]
                ssqdm_x[j] = ssqdm_x[j] - (val - prev_mean) * (val - mean_x[j])
            else:
                mean_x[j] = 0
                ssqdm_x[j] = 0


@njit(nogil=True, inline='always')
def calc_var_new(minp, ddof, nobs,
                 ssqdm_x, n_same_value):
    # cdef:
    #     float64_t result

    result = np.empty(len(nobs))

    for i, (_nobs, _ssqdm_x, _nsv) in enumerate(zip(nobs, ssqdm_x, n_same_value)):
        if (_nobs >= minp) and (_nobs > ddof):
            # pathological case
            if _nobs == 1 or _nsv >= _nobs:  # if num same value >= num obs
                result[i] = 0
            else:
                result[i] = _ssqdm_x / (_nobs - ddof)
        else:
            result[i] = np.nan

    return result


@njit(nogil=True)
def roll_var_new(values, wind, minp, ddof=1) -> np.ndarray:
    """
    Numerically stable implementation using Welford's method.
    """
    # cdef:
    #     float64_t mean_x, ssqdm_x, nobs, compensation_add,
    #     float64_t compensation_remove,
    #     float64_t val, prev, delta, mean_x_old
    #     int64_t s, e
    #     Py_ssize_t i, j, N = len(start)
    #     ndarray[float64_t] output
    #     bint is_monotonic_increasing_bounds

    minp = max(minp, 1)
    # is_monotonic_increasing_bounds = is_monotonic_increasing_start_end_bounds(
    #     start, end
    # )
    n, m = values.shape
    output = np.empty((n, m), dtype=np.float64)

    mean_x = np.zeros(m)
    ssqdm_x = np.zeros(m)
    nobs = np.zeros(m)
    compensation_add = np.zeros(m)
    compensation_remove = np.zeros(m)
    n_same_value = np.zeros(m, dtype=np.int64)
    prev_value = values[0].copy()  # record first value

    for i in range(n - wind + 1):

        s = i
        e = wind + s

        # Over the first window, observations can only be added
        # never removed
        if i == 0:
            for j in range(s, e):
                add_var_new(values[j], nobs, mean_x, ssqdm_x, compensation_add, n_same_value, prev_value)
                output[j] = np.nan
        else:

            # After the first window, observations can both be added
            # and removed

            # calculate deletes
            remove_var(values[s - 1], nobs, mean_x, ssqdm_x, compensation_remove)

            # calculate adds
            add_var_new(values[e - 1], nobs, mean_x, ssqdm_x, compensation_add, n_same_value, prev_value)

        output[e - 1] = calc_var_new(minp, ddof, nobs, ssqdm_x, n_same_value)

    return output

How it works

I record the previous value in add_var_new and count num of consecutively same value (n_same_value). If n_same_value >= min_periods(nobs), then I set result as zero in calc_var_new.

Some examples

Then first let's have a df with big number at the top:

df = pd.DataFrame(np.zeros(10))
df.iloc[0] = 1000
df.rolling(5).std()
Out[32]: 
              0
0           NaN
1           NaN
2           NaN
3           NaN
4  2.000000e+05
5  2.910383e-11
6  2.910383e-11
7  2.910383e-11
8  2.910383e-11
9  2.910383e-11

It's same as numba version:

roll_var_old(df.values, 5, 5, 1)
Out[39]: 
array([[           nan],
       [           nan],
       [           nan],
       [           nan],
       [2.00000000e+05],
       [2.91038305e-11],
       [2.91038305e-11],
       [2.91038305e-11],
       [2.91038305e-11],
       [2.91038305e-11]])

My version:

roll_var_new(df.values, 5, 5, 1)
Out[40]: 
array([[    nan],
       [    nan],
       [    nan],
       [    nan],
       [200000.],
       [     0.],
       [     0.],
       [     0.],
       [     0.],
       [     0.]])

Performance impact

array = np.random.randn(1000, 10)

times_old = timeit.repeat('roll_var_old(array, 10, 10, 1)', number=1000, globals=globals())
np.sum(times_old)
# 0.1767864041030407

times_new = timeit.repeat('roll_var_new(array, 10, 10, 1)', number=1000, globals=globals())
np.sum(times_new)
# 0.20479978621006012

mroeschke · 2022-03-18T17:10:42Z

So the 2D input is the reason why there are no floating point artifacts? I can't easily discern what the old vs new difference is based on all the code.

But generally, if different processing removes these floating point artifacts (and passes all the tests), IMO a performance hit is okay.

auderson · 2022-03-19T04:06:48Z

So in short, the new algo will record and count how many same values have appeared:

# in `add_var`
if val == prev_value[0] and val != MAXfloat64 and val != MINfloat64:
    n_same_value[0] += 1  # incr count
else:
    n_same_value[0] = 1  # if not same, reset count to 1 (include itself)

prev_value[0] = val  # store prev value

If values are the same, we use 0 as the result, instead of _ssqdm_x / (_nobs - ddof)

# in `calc_var`
if (nobs >= minp) and (nobs > ddof):

    # pathological case & repeatedly same values case
    if nobs == 1 or num_consecutive_same_value >= nobs:
        result = 0
    else:
        result = ssqdm_x / (nobs - <float64_t>ddof)
else:
    result = NaN

The floating point artifacts comes from _ssqdm_x / (_nobs - ddof), and if we know the values are actually the same, I think it's safe to use 0 as the result (instead of 2.91038305e-11 from the example)

auderson added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 17, 2021

attack68 added Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 17, 2021

mroeschke added the Closing Candidate May be closeable, needs more eyeballs label Jun 17, 2021

mroeschke closed this as completed Aug 21, 2021

auderson mentioned this issue Mar 19, 2022

ENH: use same value counts for roll_var to remove floating point artifacts #46431

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: rolling std can't handle mixture of (relatively) big and small numbers #42064

BUG: rolling std can't handle mixture of (relatively) big and small numbers #42064

auderson commented Jun 17, 2021 •

edited

Loading

INSTALLED VERSIONS

attack68 commented Jun 17, 2021 •

edited

Loading

jreback commented Jun 17, 2021

attack68 commented Jun 17, 2021

mroeschke commented Jun 17, 2021

STOpandthink commented Jul 3, 2021

mroeschke commented Aug 21, 2021

auderson commented Mar 18, 2022

mroeschke commented Mar 18, 2022

auderson commented Mar 19, 2022 •

edited

Loading

BUG: rolling std can't handle mixture of (relatively) big and small numbers #42064

BUG: rolling std can't handle mixture of (relatively) big and small numbers #42064

Comments

auderson commented Jun 17, 2021 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

attack68 commented Jun 17, 2021 • edited Loading

jreback commented Jun 17, 2021

attack68 commented Jun 17, 2021

mroeschke commented Jun 17, 2021

STOpandthink commented Jul 3, 2021

mroeschke commented Aug 21, 2021

auderson commented Mar 18, 2022

Code

How it works

Some examples

Performance impact

mroeschke commented Mar 18, 2022

auderson commented Mar 19, 2022 • edited Loading

auderson commented Jun 17, 2021 •

edited

Loading

Output of `pd.show_versions()`

attack68 commented Jun 17, 2021 •

edited

Loading

auderson commented Mar 19, 2022 •

edited

Loading