Skip to content

Pandas df.rolling.mean() abnormal behavior for a Series having larger numbers (in the scale of billions) #28305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bobnvk99 opened this issue Sep 5, 2019 · 6 comments
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@bobnvk99
Copy link

bobnvk99 commented Sep 5, 2019

Code Sample, a copy-pastable example if possible

#Sample residual pcts
residuals_pct =[0.044516001,0.031137117,1.06758E+64,0.003522454,0.065171486,0.06033751,0.01325514,-0.005620799,-0.006225719,0.045713825,0.039280786,0.000531307]

#Creating a dataframe with residual percent
d = pd.DataFrame(residuals, columns=['residual_pct'])
#Shifting the pct 1 row down.
d['residual_pct_shift'] = d.residual_pct.shift(1)
#Calculating the adjusted residual pct.
d['adj_residual_pct'] = d['residual_pct_shift'][3:].rolling(window=3).mean()

Problem description

I am trying to implement adjusted residual percent as the rolling average of the previous three residual percents.
Due to some data issue, model forecasted the target variable in the scale of billions, consequently effecting the residual percentage calculation.
However, this is irrelevant to the current problem, the issue is when doing the rolling average using .....rolling(window=3).mean(), the average for the rows after the abnormal residual percent got affected.
If I do take those same numbers and do the normal way of averaging or use MS Excel, I got it right.

Please see the image below:
pandas_rolling_mean_issue

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

Is there anything that I am missing or unaware. I tried to check in the pandas' source code and found nothing. Irrespective of the scale of the numbers, excel and the normal way of averaging works fine.

Please advise.
I greatly appreciate your help.

Thank you.

@bobnvk99
Copy link
Author

bobnvk99 commented Sep 5, 2019

Actually, similar behavior is observed with pandas 0.24 version.

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

Hmm not sure I follow. Can you provide self-contained code samples to reproduce? Unfortunately screenshots are not very helpful

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Sep 5, 2019
@dsaxton
Copy link
Member

dsaxton commented Sep 5, 2019

Could be the same problem pointed out in this issue: #28244

@bobnvk99
Copy link
Author

bobnvk99 commented Sep 6, 2019

I think I added the lines that can reproduce this behavior. Not sure if it is missing.

Here it is:
#Sample residual pcts
residuals_pct =[0.044516001,0.031137117,1.06758E+64,0.003522454,0.065171486,0.06033751,0.01325514,-0.005620799,-0.006225719,0.045713825,0.039280786,0.000531307]

#Creating a dataframe with residual percent
d = pd.DataFrame(residuals, columns=['residual_pct'])
#Shifting the pct 1 row down.
d['residual_pct_shift'] = d.residual_pct.shift(1)
#Calculating the adjusted residual pct.
d['adj_residual_pct'] = d['residual_pct_shift'][3:].rolling(window=3).mean()

@bobnvk99
Copy link
Author

bobnvk99 commented Sep 9, 2019

Thank you @dsaxton
I understand that I should be using rolling.apply() to achieve the expected behavior; tested it and is working fine.
Hopefully, this rolling().mean() issue will be resolved in the next release.

@simonjayhawkins
Copy link
Member

I think I added the lines that can reproduce this behavior. Not sure if it is missing.

@bobnvk99 Thanks for the report. closing as this doesn't look actionable. We would need a minimal reproducible example to help debug the issue. see https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

the code sample is not executable, raises NameError: name 'residuals' is not defined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

4 participants