Pandas df.rolling.mean() abnormal behavior for a Series having larger numbers (in the scale of billions) #28305

bobnvk99 · 2019-09-05T21:39:19Z

Code Sample, a copy-pastable example if possible

#Sample residual pcts
residuals_pct =[0.044516001,0.031137117,1.06758E+64,0.003522454,0.065171486,0.06033751,0.01325514,-0.005620799,-0.006225719,0.045713825,0.039280786,0.000531307]

#Creating a dataframe with residual percent
d = pd.DataFrame(residuals, columns=['residual_pct'])
#Shifting the pct 1 row down.
d['residual_pct_shift'] = d.residual_pct.shift(1)
#Calculating the adjusted residual pct.
d['adj_residual_pct'] = d['residual_pct_shift'][3:].rolling(window=3).mean()

Problem description

I am trying to implement adjusted residual percent as the rolling average of the previous three residual percents.
Due to some data issue, model forecasted the target variable in the scale of billions, consequently effecting the residual percentage calculation.
However, this is irrelevant to the current problem, the issue is when doing the rolling average using .....rolling(window=3).mean(), the average for the rows after the abnormal residual percent got affected.
If I do take those same numbers and do the normal way of averaging or use MS Excel, I got it right.

Please see the image below:

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

Is there anything that I am missing or unaware. I tried to check in the pandas' source code and found nothing. Irrespective of the scale of the numbers, excel and the normal way of averaging works fine.

Please advise.
I greatly appreciate your help.

Thank you.

The text was updated successfully, but these errors were encountered:

bobnvk99 · 2019-09-05T21:44:22Z

Actually, similar behavior is observed with pandas 0.24 version.

WillAyd · 2019-09-05T21:46:17Z

Hmm not sure I follow. Can you provide self-contained code samples to reproduce? Unfortunately screenshots are not very helpful

dsaxton · 2019-09-05T23:04:55Z

Could be the same problem pointed out in this issue: #28244

bobnvk99 · 2019-09-06T17:44:52Z

I think I added the lines that can reproduce this behavior. Not sure if it is missing.

Here it is:
#Sample residual pcts
residuals_pct =[0.044516001,0.031137117,1.06758E+64,0.003522454,0.065171486,0.06033751,0.01325514,-0.005620799,-0.006225719,0.045713825,0.039280786,0.000531307]

#Creating a dataframe with residual percent
d = pd.DataFrame(residuals, columns=['residual_pct'])
#Shifting the pct 1 row down.
d['residual_pct_shift'] = d.residual_pct.shift(1)
#Calculating the adjusted residual pct.
d['adj_residual_pct'] = d['residual_pct_shift'][3:].rolling(window=3).mean()

bobnvk99 · 2019-09-09T15:52:35Z

Thank you @dsaxton
I understand that I should be using rolling.apply() to achieve the expected behavior; tested it and is working fine.
Hopefully, this rolling().mean() issue will be resolved in the next release.

simonjayhawkins · 2020-04-01T14:02:05Z

I think I added the lines that can reproduce this behavior. Not sure if it is missing.

@bobnvk99 Thanks for the report. closing as this doesn't look actionable. We would need a minimal reproducible example to help debug the issue. see https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

the code sample is not executable, raises NameError: name 'residuals' is not defined.

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 5, 2019

simonjayhawkins closed this as completed Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas df.rolling.mean() abnormal behavior for a Series having larger numbers (in the scale of billions) #28305

Pandas df.rolling.mean() abnormal behavior for a Series having larger numbers (in the scale of billions) #28305

bobnvk99 commented Sep 5, 2019

bobnvk99 commented Sep 5, 2019

WillAyd commented Sep 5, 2019

dsaxton commented Sep 5, 2019

bobnvk99 commented Sep 6, 2019

bobnvk99 commented Sep 9, 2019

simonjayhawkins commented Apr 1, 2020

Pandas df.rolling.mean() abnormal behavior for a Series having larger numbers (in the scale of billions) #28305

Pandas df.rolling.mean() abnormal behavior for a Series having larger numbers (in the scale of billions) #28305

Comments

bobnvk99 commented Sep 5, 2019

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

bobnvk99 commented Sep 5, 2019

WillAyd commented Sep 5, 2019

dsaxton commented Sep 5, 2019

bobnvk99 commented Sep 6, 2019

bobnvk99 commented Sep 9, 2019

simonjayhawkins commented Apr 1, 2020

Output of `pd.show_versions()`