Skip to content

Bug in df.pct_change() #6389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JazzFan opened this issue Feb 17, 2014 · 7 comments
Closed

Bug in df.pct_change() #6389

JazzFan opened this issue Feb 17, 2014 · 7 comments

Comments

@JazzFan
Copy link

JazzFan commented Feb 17, 2014

df.pct_change incorrectly calculates a value when it shouldn't.

#!/usr/bin/env python

import numpy as np
import pandas as pd
from datetime import datetime


"""
Notice that the two nonmissing data points are 8 days apart, not 7.
Therefore, any attempt to calculate a 7-day (week-over-week) percent change
should result in NaN, not an actual numeric value.
The bug is that pct_change does calculate an (erroneous) actual numeric value.
(The value it calculates is 414.53/430.30 - 1 = -0.036649.)
"""
idx = pd.DatetimeIndex(start=datetime(2013, 9, 12),
                       end=datetime(2013, 9, 21),
                       freq='D')
df = pd.DataFrame(data={'Series': [430.30,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   414.53,
                                   np.NaN]}, index=idx)
pd.show_versions()
print('\n\nInput data:')
print(df)
print("\n\nThe bug is the pct_change for 09-20 should be NaN, but it isn't:\n")
print(df.pct_change(periods=7))

Here is the output:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 12.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.19.2
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.1.0
sphinx: 1.1.3
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.0.0
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.8.3
lxml: 3.2.3
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None


Input data:
            Series
2013-09-12  430.30
2013-09-13     NaN
2013-09-14     NaN
2013-09-15     NaN
2013-09-16     NaN
2013-09-17     NaN
2013-09-18     NaN
2013-09-19     NaN
2013-09-20  414.53
2013-09-21     NaN

[10 rows x 1 columns]


The bug is the pct_change for 09-20 should be NaN, but it isn't:

              Series
2013-09-12       NaN
2013-09-13       NaN
2013-09-14       NaN
2013-09-15       NaN
2013-09-16       NaN
2013-09-17       NaN
2013-09-18       NaN
2013-09-19       NaN
2013-09-20 -0.036649
2013-09-21       NaN

[10 rows x 1 columns]
@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

Not a bug, it is pad filling (as the default), which is fill-forward, you can disable by passing fill_method=None)

In [14]: df.pct_change(7,fill_method=None)
Out[14]: 
            Series
2013-09-12     NaN
2013-09-13     NaN
2013-09-14     NaN
2013-09-15     NaN
2013-09-16     NaN
2013-09-17     NaN
2013-09-18     NaN
2013-09-19     NaN
2013-09-20     NaN
2013-09-21     NaN

[10 rows x 1 columns]

In [15]: df.pct_change(7,fill_method='pad')
Out[15]: 
              Series
2013-09-12       NaN
2013-09-13       NaN
2013-09-14       NaN
2013-09-15       NaN
2013-09-16       NaN
2013-09-17       NaN
2013-09-18       NaN
2013-09-19       NaN
2013-09-20 -0.036649
2013-09-21       NaN

[10 rows x 1 columns]

@jreback jreback closed this as completed Feb 17, 2014
@selik
Copy link
Contributor

selik commented Oct 31, 2017

@jreback While pad filling is convenient, I think it's quite dangerous as the default. "Errors should never pass silently" and all that Zen. Pandas often drops NaNs for convenience, but dropping NaNs when plotting (for example) isn't nearly as easy to overlook. In fact, I just told someone a moment ago that pct_change did not fill, which was completely inaccurate. Luckily I tested it and discovered otherwise. Reading the documentation or even checking parameters didn't cross my mind until I was surprised by the result.

Backwards compatibility is important, too, but perhaps there's a chance this method could default to fill_method=None? The existence of this "bug" report suggests no-fill is more intuitive.

For comparison,

lag = series.shift(1)
change = (series - lag) / lag

seems like it would create the same results as series.pct_change(1) but the filling is a subtle difference.

I haven't done a systematic survey of use cases, but in my experience, most time-series analysis would prefer no-fill. When fill-forward is desired, I'd expect that the fill would be done before the pct_change as a more explicit step.

@alarrain
Copy link

alarrain commented Aug 17, 2018

Hi guys, I think this issue should be re-opened. @selik is absolutely right that "Errors should never pass silently" and unless explicitly done so, this is an error. Not technically, as the default is currently fill_method='pad', but that is not an answer. That is just stating the problem. Upgrading this to fill_method=None is in now way different to deprecating inplace=True, or any other relatively major change that pandas has gone or will go through.

@PieCampi
Copy link

PieCampi commented Apr 5, 2019

I agree that this is definitely dangerous as a default value and that this issue should be reopened, if not solved elsewhere.

Using pandas 0.24.1 I realised this only at the end of a complicated ML pipeline.
The default behaviour is still 'pad', which is at least deceiving in my opinion.

@selik
Copy link
Contributor

selik commented Sep 15, 2019

@jreback Can we reopen this issue with the idea of making a deprecation cycle to switch the default behavior?

@jreback
Copy link
Contributor

jreback commented Sep 15, 2019

pls make a new issue

@selik
Copy link
Contributor

selik commented Sep 16, 2019

New issue: #28461 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants