BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

rhkarls · 2020-09-14T09:38:06Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Interpolating "inside" gaps with limit gap size of 3

import pandas as pd
import numpy as np

ts_index = pd.date_range('2016-01-01','2016-01-2',freq='H')
limit_gap_size = 3

df = pd.DataFrame(index=ts_index, data={'raw':np.random.uniform(size=ts_index.size)})

df.iloc[0:3] = np.nan
df.iloc[5:7] = np.nan
df.iloc[10:16] = np.nan
df.iloc[17:20] = np.nan
df.iloc[23:25] = np.nan

df['filled'] = df.interpolate(limit=limit_gap_size,limit_area='inside')

# solution to specific case
# credit functions below: https://stackoverflow.com/a/54512613/752092
def bfill_nan(arr):
    """ Backward-fill NaNs """
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
    idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
    out = arr[idx]
    return out

def calc_mask(arr, maxgap):
    """ Mask NaN gaps longer than `maxgap` """
    isnan = np.isnan(arr)
    cumsum = np.cumsum(isnan).astype('float')
    diff = np.zeros_like(arr)
    diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
    diff[isnan] = np.nan
    diff = bfill_nan(diff)
    return (diff <= maxgap) | ~isnan # <= instead of < compared to SO answer


df['expected'] = df['raw'].interpolate(limit=3,limit_area='inside').where(calc_mask(df['raw'],limit_gap_size))

df

                          raw    filled  expected
2016-01-01 00:00:00       NaN       NaN       NaN
2016-01-01 01:00:00       NaN       NaN       NaN
2016-01-01 02:00:00       NaN       NaN       NaN
2016-01-01 03:00:00  0.781920  0.781920  0.781920
2016-01-01 04:00:00  0.732783  0.732783  0.732783
2016-01-01 05:00:00       NaN  0.545743  0.545743
2016-01-01 06:00:00       NaN  0.358704  0.358704
2016-01-01 07:00:00  0.171664  0.171664  0.171664
2016-01-01 08:00:00  0.689487  0.689487  0.689487
2016-01-01 09:00:00  0.131983  0.131983  0.131983
2016-01-01 10:00:00       NaN  0.140856       NaN
2016-01-01 11:00:00       NaN  0.149729       NaN
2016-01-01 12:00:00       NaN  0.158601       NaN
2016-01-01 13:00:00       NaN       NaN       NaN
2016-01-01 14:00:00       NaN       NaN       NaN
2016-01-01 15:00:00       NaN       NaN       NaN
2016-01-01 16:00:00  0.194093  0.194093  0.194093
2016-01-01 17:00:00       NaN  0.330719  0.330719
2016-01-01 18:00:00       NaN  0.467345  0.467345
2016-01-01 19:00:00       NaN  0.603971  0.603971
2016-01-01 20:00:00  0.740598  0.740598  0.740598
2016-01-01 21:00:00  0.223751  0.223751  0.223751
2016-01-01 22:00:00  0.383625  0.383625  0.383625
2016-01-01 23:00:00       NaN       NaN       NaN
2016-01-02 00:00:00       NaN       NaN       NaN

Problem description

When passing a limit this is expected to be respected and gaps larger than this should not be interpolated at all. Partially filling the beginning (or end, depending on limit_direction) of the gaps is not reasonable behavior

Expected Output

See output in example

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2a7d332
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.2
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200814
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : None
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2020-09-14T20:31:48Z

Hey, thanks for your report.

That is the documented behavior. limit=3 means, that at most three consecutive nans are filled. limit_area inside means, that there have to be valid values before and after the nans

rhkarls · 2020-09-15T06:59:23Z

Thanks! The documentation does indeed say consecutive nans, but I still think the more expected behavior from the documentation is that gaps larger than the limit is not filled at all. Take for example a time series with several small and large gaps, it can make sense to fill smaller gaps by interpolating between them, but larger gaps do often not make sense interpolate at all. The method now leaves some strange values at the beginning/end of these large gaps.

Anyway, I see that fillna(), which also has a limit keyword, does elaborate on the behavior in the documentation - perhaps include that in interpolate as well?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.fillna.html

I also see now that there was some effort adding a max_gap keyword (#25141), hopefully that continues.

phofl · 2020-09-15T07:52:12Z

Would you like to contribute a PR to enhance the docs? We have an example showing that behavior, maybe add a comment to elaborate furhter?

If you would like to discuss the enhancement of the function, you could open an enhancement issue and describe your wishes.

rhkarls · 2020-09-16T14:31:57Z

@phofl I will look at the contributing guide when I have time and hopefully figure out how to do a PR - it will be a good learning experience!

bgroenks96 · 2021-07-03T13:09:44Z

This is very surprising behavior, and I would argue that it should not be the default, especially for linear interpolation.

I can see why some might want this behavior for ffill and bfill methods, but with interpolation it's just wrong. If the gap is too big to be filled, partially interpolating using the value from the other side of the gap is just generating garbage data.

Thus, I would argue, for the non-fill cases, this is a bug, not merely an enhancement.

Example:

series = pd.Series([1,1,1,np.nan,np.nan,np.nan,np.nan,5,6,np.nan,7])
series.interpolate(method='linear', limit=2, limit_direction='forward')

output:

0     1.0
1     1.0
2     1.0
3     1.8
4     2.6
5     NaN
6     NaN
7     5.0
8     6.0
9     6.5
10    7.0
dtype: float64

CronJorian · 2023-10-10T10:08:30Z

Sorry to bump this issue, but I find this rather frustrating too, as I either understand @phofl wrong or it does not work as he said.

pd.Series([1,2,3,4,np.NaN,np.NaN,np.NaN,np.NaN,9, 10, 11]).interpolate("linear", limit=3, limit_area="inside")

results in:

0      1.0
1      2.0
2      3.0
3      4.0
4      5.0
5      6.0
6      7.0
7      NaN
8      9.0
9     10.0
10    11.0
dtype: float64

If I understand this correctly, the limit_area parameter should prevent interpolation here due to the consecutive np.NaN not having valid values on both sides.
The actual behavior is that the interpolation is executed despite the parameter.

My recommendation is to either adjust the docs to reflect this behaviour or to prevent filling gaps larger than the limit.

rhkarls added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 14, 2020

phofl added Docs and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020

joAschauer mentioned this issue Jul 2, 2021

ENH: DataFrame.interpolate limit to support all-or-none filling #42291

Open

mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

rhkarls commented Sep 14, 2020

INSTALLED VERSIONS

phofl commented Sep 14, 2020

rhkarls commented Sep 15, 2020

phofl commented Sep 15, 2020

rhkarls commented Sep 16, 2020

bgroenks96 commented Jul 3, 2021

CronJorian commented Oct 10, 2023

BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

Comments

rhkarls commented Sep 14, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

phofl commented Sep 14, 2020

rhkarls commented Sep 15, 2020

phofl commented Sep 15, 2020

rhkarls commented Sep 16, 2020

bgroenks96 commented Jul 3, 2021

CronJorian commented Oct 10, 2023

Output of `pd.show_versions()`