You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you all for creating and maintaining this package, and for the obvious care that has gone into the numerical analysis of the rolling methods.
In the example above I show a rolling mean over a 1-minute window, but the final point jumps ahead by one day. Since this is the only point in the final window, I expect the mean here to be exactly the single value in the window. You can see however that there is some error in the low-order 4 dights displayed.
I believe this is due to an omission in the function roll_mean_variable which I am reading in the file pandas/_libs/window/aggregations.pyx. I believe that the function should zero the running sum whenever the number of observations goes to 0 after a "remove" operation. Otherwise, some noise from previous periods will be left in the running sum.
If I apply this patch to my local copy then I get the exact mean.
diff --git a/pandas/_libs/window/aggregations.pyx b/pandas/_libs/window/aggregations.pyx
index 495b436..76ba1dd 100644
--- a/pandas/_libs/window/aggregations.pyx
+++ b/pandas/_libs/window/aggregations.pyx
@@ -344,20 +344,24 @@ def roll_mean_variable(ndarray[float64_t] values, ndarray[int64_t] start,
val = values[j]
add_mean(val, &nobs, &sum_x, &neg_ct)
else:
# calculate deletes
for j in range(start[i - 1], s):
val = values[j]
remove_mean(val, &nobs, &sum_x, &neg_ct)
+ # Reset sum if NOBS goes to 0
+ if nobs == 0:
+ sum_x = 0
+
# calculate adds
for j in range(end[i - 1], e):
val = values[j]
add_mean(val, &nobs, &sum_x, &neg_ct)
output[i] = calc_mean(minp, nobs, neg_ct, sum_x)
if not is_monotonic_bounds:
for j in range(s, e):
val = values[j]
I understand that there is a philosophical question: Given that we know that most rolling means are likely not bit-accurate anyway, is it worth doing a check for this special case? One argument would be that for time series like mine with gaps here and there, the gaps would serve to clear the accumulated errors every time a gap occurred, and in general the accuracy of the procedure would improve.
I notice that the rolling variance calculations do reset the accumulators to 0 when they completely depopulate a window. I am curious though, is there a numerical-analysis reason for doing the adds before the removes there? This order nullifies the resetting-effect of clearing the accumulators I believe, because my windows will never go to empty with this ordering. This is the code I'm referring to:
First, thank you all for creating and maintaining this package, and for the obvious care that has gone into the numerical analysis of the
rolling
methods.Code Sample, a copy-pastable example if possible
Result
Problem description
In the example above I show a rolling mean over a 1-minute window, but the final point jumps ahead by one day. Since this is the only point in the final window, I expect the mean here to be exactly the single value in the window. You can see however that there is some error in the low-order 4 dights displayed.
I believe this is due to an omission in the function
roll_mean_variable
which I am reading in the filepandas/_libs/window/aggregations.pyx
. I believe that the function should zero the running sum whenever the number of observations goes to 0 after a "remove" operation. Otherwise, some noise from previous periods will be left in the running sum.If I apply this patch to my local copy then I get the exact mean.
I understand that there is a philosophical question: Given that we know that most rolling means are likely not bit-accurate anyway, is it worth doing a check for this special case? One argument would be that for time series like mine with gaps here and there, the gaps would serve to clear the accumulated errors every time a gap occurred, and in general the accuracy of the procedure would improve.
I notice that the rolling variance calculations do reset the accumulators to 0 when they completely depopulate a window. I am curious though, is there a numerical-analysis reason for doing the adds before the removes there? This order nullifies the resetting-effect of clearing the accumulators I believe, because my windows will never go to empty with this ordering. This is the code I'm referring to:
pandas/pandas/_libs/window/aggregations.pyx
Line 520 in 5da500a
Thank you,
Bishop Brock
Expected Output
I expect the rolling mean of any window specified with a time period with a single data point to be equal to that data point.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.2+0.g7485dbe6f.dirty
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.45.1
The text was updated successfully, but these errors were encountered: