-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pandas.Dataframe.interpolate() does not extrapolate even if it is asked to, depending on interpolation method #31949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I second this. Also, even when it works, it doesn't. The implied meaning of "extrapolate" is that it will continue on the last available trend. However, the observed result is that the last value is repeated. In:
Out:
|
I also stumbled on this bug. Also examples in current documentation are confusing - extrapolation mentioned there "fill NaNs outside valid values (extrapolate)" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html: df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
(np.nan, 2.0, np.nan, np.nan),
(2.0, 3.0, np.nan, 9.0),
(np.nan, 4.0, -4.0, 16.0)],
columns=list('abcd'))
df
a b c d
0 0.0 NaN -1.0 1.0
1 NaN 2.0 NaN NaN
2 2.0 3.0 NaN 9.0
3 NaN 4.0 -4.0 16.0
df.interpolate(method='linear', limit_direction='forward', axis=0)
a b c d
0 0.0 NaN -1.0 1.0
1 1.0 2.0 -2.0 5.0
2 2.0 3.0 -3.0 9.0
3 2.0 4.0 -4.0 16.0 ... but this is not a linear extrapolation a
0 0.0
1 NaN
2 2.0
3 NaN
a
0 0.0
1 1.0
2 2.0
3 2.0 |
Based on discussions in #8000 it seems we need an argument to specify that extrapolation at the beginning and end of the series can be specified. Alternatively, the docs could reflect that such extrapolation is not provided by |
@Dr-Irv Is this still on the project's roadmap? |
We are always open to a PR that would address this issue. For issues like this, we don't put them on a roadmap. We are just open to the community addressing them! |
@Dr-Irv Thanks for the explanation. Unfortunately I am not familiar enough with Pandas to see myself making a real contribution. But here is the solution I am currently working with: def extrapolate_linear(s):
s = s.copy()
# Indices of not-nan values
idx_nn = s.index[~s.isna()]
# At least two data points needed for trend analysis
assert len(idx_nn) >= 2
# Outermost indices
idx_l = idx_nn[0]
idx_r = idx_nn[-1]
# Indices left and right of outermost values
idx_ll = s.index[s.index < idx_l]
idx_rr = s.index[s.index > idx_r]
# Derivative of not-nan indices / values
v = s[idx_nn].diff()
# Left- and right-most derivative values
v_l = v[1]
v_r = v[-1]
f_l = idx_l - idx_nn[1]
f_r = idx_nn[-2] - idx_r
# Set values left / right of boundaries
l_l = lambda idx: (idx_l - idx) / f_l * v_l + s[idx_l]
l_r = lambda idx: (idx_r - idx) / f_r * v_r + s[idx_r]
x_l = pd.Series(idx_ll).apply(l_l)
x_l.index = idx_ll
x_r = pd.Series(idx_rr).apply(l_r)
x_r.index = idx_rr
s[idx_ll] = x_l
s[idx_rr] = x_r
return s Exemplary usage is documented in this notebook. Hopefully this is inspiration enough for more competent people to make a PR. EDIT: I've vectorized the function. |
I don't think that is a correct interpretation. The pandas docs specify that
So the following should work, with the current method signature and no new argument: import numpy as np
import pandas as pd
# A 1-D Series with missing external values
x = [0.5, 1, 2, 3, 20]
y = [np.NaN, 1, 4, 9, np.NaN]
s = pd.Series(y, index=x)
# Expected usage
kw = dict(method="quadratic", fill_value="extrapolate")
s.interpolate(**kw) But this fails. A value is extrapolated on the 'forward' end of
The code simply does not do what is advertised by the docs, so this is clearly (IMO) a bug. (Also a surprising one, since scipy 0.17 was 5 years ago, and one would think that pandas' use of basic features in numpy and scipy was stable and tested.) A slightly more compact workaround than @flxai's: from scipy.interpolate import interp1d
# A mask indicating where `s` is not null
m = s.notna()
# Construct an interpolator from the non-null values
# NB 'kind' instead of 'method'!
kw = dict(kind="quadratic", fill_value="extrapolate")
f = interp1d(s[m].index, s[m], **kw)
# Apply this to the indices of the nulls; reconstruct a series
s2 = pd.Series(f(s[~m].index), index=s[~m].index)
# Fill `s` using the values from `s2`
result = s.fillna(s2)
# Previous 3 statements combined:
result = s.fillna(
pd.Series(
interp1d(s[m].index, s[m], **kw)(s[~m].index),
index=s[~m].index,
)
) This gives the expected result:
|
@khaeru thanks for your code - I think I found a more elegant solution. As you mentioned, the extrapolation only works in the forward direction, so you can just flip the series, apply another extrapolation, and flip it back again. Using your example (just one extrapolation):
gives us this:
But with my solution:
we get this:
|
Sure, that also works! I don't know pandas' internals, so I can't guess whether your workaround, mine, or some other would perform best on large(r) series. People should consider their particular use-cases, and check/test. Since this is a bug, these are only workarounds. The real ‘solution’ will be a PR by someone who knows the internals well enough to make one. |
@khaeru @lyndonchan To extrapolate in both directions, use import numpy as np
import pandas as pd
# A 1-D Series with missing external values
x = [0.5, 1, 2, 3, 20]
y = [np.NaN, 1, 4, 9, np.NaN]
s = pd.Series(y, index=x)
# Expected usage
kw = dict(method="quadratic", fill_value="extrapolate", limit_direction="both")
s.interpolate(**kw) This gives:
|
The method @zhihua-zheng provides, unfortunately does not work properly for Here's an example:
|
Late to the party here, but we've experienced similar issues over at xarray. One very obvious problem is that Lines 504 to 510 in f00efd0
Although One solution to fix this inconsistent behaviour is to either allow
As the |
I second what others have said, the current behaviour is very surprising:
This yields an utter mess:
How is this a mess? Well it uses the linear interpolation that I specified for interpolation but for extrapolation it uses bfill and ffill instead of linear extrapolation. I can work around this by using the following instead:
with output:
but this workaround was hard to find. Either |
Furthermore, the default to extrapolate forward ( |
Is there a chance to get the changes I suggest (do the right thing for extrapolation and don't extrapolate by default) into 3.0.0? A major release would be required to clean up the API. Is a PR welcome? Has someone already been working on parts of my suggestions? |
I notice that the workaround I mentioned above doesn't work if there are duplicate x-values:
interpolates just fine but doesn't extrapolate:
but delegating to scipy:
raises an error:
|
Code Sample, a copy-pastable example if possible
Problem description
Some of the offered methods (it seems all of them that are provided by interp1d) are unable to extrapolate over np.nan. However, the limit_area switch for df.interpolate() indicates you can force extrapolation. A combination of limit_area=None and an incompatible method should raise a warning.
There used to be a similar issue where extrapolation over trailing NaN was done unintentionally, so maybe the fix for that overdid it. #8000
Expected Output
Extrapolation over the NaNs in the array is expected. Using a different method, such as pchip achieves this.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit : None
python : 3.7.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None
pandas : 0.25.3 (also tested with 1.0.0)
numpy : 1.15.4
pytz : 2018.9
dateutil : 2.7.5
pip : 20.0.2
setuptools : 41.0.1
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : 1.8.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.0.3
numexpr : None
odfpy : None
openpyxl : 2.5.12
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
The text was updated successfully, but these errors were encountered: