Bug in df.pct_change() #6389

JazzFan · 2014-02-17T22:37:49Z

df.pct_change incorrectly calculates a value when it shouldn't.

#!/usr/bin/env python

import numpy as np
import pandas as pd
from datetime import datetime


"""
Notice that the two nonmissing data points are 8 days apart, not 7.
Therefore, any attempt to calculate a 7-day (week-over-week) percent change
should result in NaN, not an actual numeric value.
The bug is that pct_change does calculate an (erroneous) actual numeric value.
(The value it calculates is 414.53/430.30 - 1 = -0.036649.)
"""
idx = pd.DatetimeIndex(start=datetime(2013, 9, 12),
                       end=datetime(2013, 9, 21),
                       freq='D')
df = pd.DataFrame(data={'Series': [430.30,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   np.NaN,
                                   414.53,
                                   np.NaN]}, index=idx)
pd.show_versions()
print('\n\nInput data:')
print(df)
print("\n\nThe bug is the pct_change for 09-20 should be NaN, but it isn't:\n")
print(df.pct_change(periods=7))

Here is the output:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 12.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.19.2
numpy: 1.8.0
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.1.0
sphinx: 1.1.3
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2013b
bottleneck: None
tables: 3.0.0
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.6.2
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.8.3
lxml: 3.2.3
bs4: 4.3.1
html5lib: None
bq: None
apiclient: None


Input data:
            Series
2013-09-12  430.30
2013-09-13     NaN
2013-09-14     NaN
2013-09-15     NaN
2013-09-16     NaN
2013-09-17     NaN
2013-09-18     NaN
2013-09-19     NaN
2013-09-20  414.53
2013-09-21     NaN

[10 rows x 1 columns]


The bug is the pct_change for 09-20 should be NaN, but it isn't:

              Series
2013-09-12       NaN
2013-09-13       NaN
2013-09-14       NaN
2013-09-15       NaN
2013-09-16       NaN
2013-09-17       NaN
2013-09-18       NaN
2013-09-19       NaN
2013-09-20 -0.036649
2013-09-21       NaN

[10 rows x 1 columns]

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-17T22:43:43Z

Not a bug, it is pad filling (as the default), which is fill-forward, you can disable by passing fill_method=None)

In [14]: df.pct_change(7,fill_method=None)
Out[14]: 
            Series
2013-09-12     NaN
2013-09-13     NaN
2013-09-14     NaN
2013-09-15     NaN
2013-09-16     NaN
2013-09-17     NaN
2013-09-18     NaN
2013-09-19     NaN
2013-09-20     NaN
2013-09-21     NaN

[10 rows x 1 columns]

In [15]: df.pct_change(7,fill_method='pad')
Out[15]: 
              Series
2013-09-12       NaN
2013-09-13       NaN
2013-09-14       NaN
2013-09-15       NaN
2013-09-16       NaN
2013-09-17       NaN
2013-09-18       NaN
2013-09-19       NaN
2013-09-20 -0.036649
2013-09-21       NaN

[10 rows x 1 columns]

selik · 2017-10-31T20:06:12Z

@jreback While pad filling is convenient, I think it's quite dangerous as the default. "Errors should never pass silently" and all that Zen. Pandas often drops NaNs for convenience, but dropping NaNs when plotting (for example) isn't nearly as easy to overlook. In fact, I just told someone a moment ago that pct_change did not fill, which was completely inaccurate. Luckily I tested it and discovered otherwise. Reading the documentation or even checking parameters didn't cross my mind until I was surprised by the result.

Backwards compatibility is important, too, but perhaps there's a chance this method could default to fill_method=None? The existence of this "bug" report suggests no-fill is more intuitive.

For comparison,

lag = series.shift(1)
change = (series - lag) / lag

seems like it would create the same results as series.pct_change(1) but the filling is a subtle difference.

I haven't done a systematic survey of use cases, but in my experience, most time-series analysis would prefer no-fill. When fill-forward is desired, I'd expect that the fill would be done before the pct_change as a more explicit step.

alarrain · 2018-08-17T08:25:09Z

Hi guys, I think this issue should be re-opened. @selik is absolutely right that "Errors should never pass silently" and unless explicitly done so, this is an error. Not technically, as the default is currently fill_method='pad', but that is not an answer. That is just stating the problem. Upgrading this to fill_method=None is in now way different to deprecating inplace=True, or any other relatively major change that pandas has gone or will go through.

PieCampi · 2019-04-05T14:38:55Z

I agree that this is definitely dangerous as a default value and that this issue should be reopened, if not solved elsewhere.

Using pandas 0.24.1 I realised this only at the end of a complicated ML pipeline.
The default behaviour is still 'pad', which is at least deceiving in my opinion.

selik · 2019-09-15T05:57:50Z

@jreback Can we reopen this issue with the idea of making a deprecation cycle to switch the default behavior?

jreback · 2019-09-15T12:14:31Z

pls make a new issue

selik · 2019-09-16T15:04:22Z

New issue: #28461 .

jreback closed this as completed Feb 17, 2014

This was referenced Sep 16, 2019

Series 'pct_change' with 'fill_method' broken #19873

Closed

DataFrame.pct_change should default fill_method=None instead of 'pad' #28461

Closed

flexatone mentioned this issue Dec 7, 2020

pct_change functionality static-frame/static-frame#285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in df.pct_change() #6389

Bug in df.pct_change() #6389

JazzFan commented Feb 17, 2014

jreback commented Feb 17, 2014

selik commented Oct 31, 2017 •

edited

Loading

alarrain commented Aug 17, 2018 •

edited

Loading

PieCampi commented Apr 5, 2019 •

edited

Loading

selik commented Sep 15, 2019

jreback commented Sep 15, 2019

selik commented Sep 16, 2019

Bug in df.pct_change() #6389

Bug in df.pct_change() #6389

Comments

JazzFan commented Feb 17, 2014

jreback commented Feb 17, 2014

selik commented Oct 31, 2017 • edited Loading

alarrain commented Aug 17, 2018 • edited Loading

PieCampi commented Apr 5, 2019 • edited Loading

selik commented Sep 15, 2019

jreback commented Sep 15, 2019

selik commented Sep 16, 2019

selik commented Oct 31, 2017 •

edited

Loading

alarrain commented Aug 17, 2018 •

edited

Loading

PieCampi commented Apr 5, 2019 •

edited

Loading