Skip to content

BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
rhkarls opened this issue Sep 14, 2020 · 6 comments
Open
2 of 3 tasks
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@rhkarls
Copy link
Contributor

rhkarls commented Sep 14, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Interpolating "inside" gaps with limit gap size of 3

import pandas as pd
import numpy as np

ts_index = pd.date_range('2016-01-01','2016-01-2',freq='H')
limit_gap_size = 3

df = pd.DataFrame(index=ts_index, data={'raw':np.random.uniform(size=ts_index.size)})

df.iloc[0:3] = np.nan
df.iloc[5:7] = np.nan
df.iloc[10:16] = np.nan
df.iloc[17:20] = np.nan
df.iloc[23:25] = np.nan

df['filled'] = df.interpolate(limit=limit_gap_size,limit_area='inside')

# solution to specific case
# credit functions below: https://stackoverflow.com/a/54512613/752092
def bfill_nan(arr):
    """ Backward-fill NaNs """
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
    idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
    out = arr[idx]
    return out

def calc_mask(arr, maxgap):
    """ Mask NaN gaps longer than `maxgap` """
    isnan = np.isnan(arr)
    cumsum = np.cumsum(isnan).astype('float')
    diff = np.zeros_like(arr)
    diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
    diff[isnan] = np.nan
    diff = bfill_nan(diff)
    return (diff <= maxgap) | ~isnan # <= instead of < compared to SO answer


df['expected'] = df['raw'].interpolate(limit=3,limit_area='inside').where(calc_mask(df['raw'],limit_gap_size))

df

                          raw    filled  expected
2016-01-01 00:00:00       NaN       NaN       NaN
2016-01-01 01:00:00       NaN       NaN       NaN
2016-01-01 02:00:00       NaN       NaN       NaN
2016-01-01 03:00:00  0.781920  0.781920  0.781920
2016-01-01 04:00:00  0.732783  0.732783  0.732783
2016-01-01 05:00:00       NaN  0.545743  0.545743
2016-01-01 06:00:00       NaN  0.358704  0.358704
2016-01-01 07:00:00  0.171664  0.171664  0.171664
2016-01-01 08:00:00  0.689487  0.689487  0.689487
2016-01-01 09:00:00  0.131983  0.131983  0.131983
2016-01-01 10:00:00       NaN  0.140856       NaN
2016-01-01 11:00:00       NaN  0.149729       NaN
2016-01-01 12:00:00       NaN  0.158601       NaN
2016-01-01 13:00:00       NaN       NaN       NaN
2016-01-01 14:00:00       NaN       NaN       NaN
2016-01-01 15:00:00       NaN       NaN       NaN
2016-01-01 16:00:00  0.194093  0.194093  0.194093
2016-01-01 17:00:00       NaN  0.330719  0.330719
2016-01-01 18:00:00       NaN  0.467345  0.467345
2016-01-01 19:00:00       NaN  0.603971  0.603971
2016-01-01 20:00:00  0.740598  0.740598  0.740598
2016-01-01 21:00:00  0.223751  0.223751  0.223751
2016-01-01 22:00:00  0.383625  0.383625  0.383625
2016-01-01 23:00:00       NaN       NaN       NaN
2016-01-02 00:00:00       NaN       NaN       NaN

Problem description

When passing a limit this is expected to be respected and gaps larger than this should not be interpolated at all. Partially filling the beginning (or end, depending on limit_direction) of the gaps is not reasonable behavior

Expected Output

See output in example

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.2
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200814
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : None
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@rhkarls rhkarls added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2020
@phofl
Copy link
Member

phofl commented Sep 14, 2020

Hey, thanks for your report.

That is the documented behavior. limit=3 means, that at most three consecutive nans are filled. limit_area inside means, that there have to be valid values before and after the nans

@rhkarls
Copy link
Contributor Author

rhkarls commented Sep 15, 2020

Thanks! The documentation does indeed say consecutive nans, but I still think the more expected behavior from the documentation is that gaps larger than the limit is not filled at all. Take for example a time series with several small and large gaps, it can make sense to fill smaller gaps by interpolating between them, but larger gaps do often not make sense interpolate at all. The method now leaves some strange values at the beginning/end of these large gaps.

Anyway, I see that fillna(), which also has a limit keyword, does elaborate on the behavior in the documentation - perhaps include that in interpolate as well?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

I also see now that there was some effort adding a max_gap keyword (#25141), hopefully that continues.

@phofl
Copy link
Member

phofl commented Sep 15, 2020

Would you like to contribute a PR to enhance the docs? We have an example showing that behavior, maybe add a comment to elaborate furhter?

If you would like to discuss the enhancement of the function, you could open an enhancement issue and describe your wishes.

@phofl phofl added Docs and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020
@rhkarls
Copy link
Contributor Author

rhkarls commented Sep 16, 2020

@phofl I will look at the contributing guide when I have time and hopefully figure out how to do a PR - it will be a good learning experience!

@bgroenks96
Copy link

This is very surprising behavior, and I would argue that it should not be the default, especially for linear interpolation.

I can see why some might want this behavior for ffill and bfill methods, but with interpolation it's just wrong. If the gap is too big to be filled, partially interpolating using the value from the other side of the gap is just generating garbage data.

Thus, I would argue, for the non-fill cases, this is a bug, not merely an enhancement.

Example:

series = pd.Series([1,1,1,np.nan,np.nan,np.nan,np.nan,5,6,np.nan,7])
series.interpolate(method='linear', limit=2, limit_direction='forward')

output:

0     1.0
1     1.0
2     1.0
3     1.8
4     2.6
5     NaN
6     NaN
7     5.0
8     6.0
9     6.5
10    7.0
dtype: float64

@mroeschke mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 13, 2021
@CronJorian
Copy link

Sorry to bump this issue, but I find this rather frustrating too, as I either understand @phofl wrong or it does not work as he said.

pd.Series([1,2,3,4,np.NaN,np.NaN,np.NaN,np.NaN,9, 10, 11]).interpolate("linear", limit=3, limit_area="inside")

results in:

0      1.0
1      2.0
2      3.0
3      4.0
4      5.0
5      6.0
6      7.0
7      NaN
8      9.0
9     10.0
10    11.0
dtype: float64

If I understand this correctly, the limit_area parameter should prevent interpolation here due to the consecutive np.NaN not having valid values on both sides.
The actual behavior is that the interpolation is executed despite the parameter.

My recommendation is to either adjust the docs to reflect this behaviour or to prevent filling gaps larger than the limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants