Enhancement Request: control extrapolation on .interpolate #16284

WBare · 2017-05-08T13:56:34Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

dfMain = pd.DataFrame({
    'a': [0, 1, np.NAN, 3, 4],
    'b': [np.NaN, np.NaN, np.NaN, 3, 4],
    'c': [0 , 1, 2, 3, np.NaN]})

for col in dfMain:
    start = dfMain[col].first_valid_index()
    end = dfMain[col].last_valid_index()
    dfMain.loc[start:end, col] = dfMain.loc[start:end, col].interpolate()

print(dfMain)

Problem description

It would be very nice to have a limit_direction='inside' that would make interpolate only fill values that are surrounded (both in front and behind) with valid values.

This would allow an interpolate to only fill missing values in a series and not extend the series beyond its original limits. The key here is that it is sometimes important to maintain the original range of a series, but still fill in the gaps.

The example shows a simple DataFrame with an 'inside' interpolation.

Expected Output

     a    b    c
0  0.0  NaN  0.0
1  1.0  NaN  1.0
2  2.0  NaN  2.0
3  3.0  3.0  3.0
4  4.0  4.0  NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-75-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 34.4.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: 0.2.1
None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-05-08T14:15:43Z

So, this kind of already works when you use the scipy methods, since that's the default for scipy when you extraploate

In [31]: dfMain.interpolate(method='slinear')
Out[31]:
     a    b    c
0  0.0  NaN  0.0
1  1.0  NaN  1.0
2  2.0  NaN  2.0
3  3.0  3.0  3.0
4  4.0  4.0  NaN

This is an implementation detail that the user shouldn't need to worry about... But I'm not sure that we can make this consistent across methods in a backwards-compatible way.

WBare · 2017-05-08T17:19:44Z

Thank you very much, @TomAugspurger ! I did not know that, and that solves the problem for me, but I agree completely that it would be nice to somehow make this more visible to the user.

I don't know, but I'm guessing method='slinear' will not have an option to respect the limit= on number of NaNs to fill in, otherwise the code could just intercept something like limit_direction='inside' and save a bunch of work.

Thanks again!!

WBare · 2017-05-09T18:09:25Z

This needs to run through the existing interpolate function so that it will respect the limit=n parameter correctly.

TomAugspurger · 2017-05-22T15:23:10Z

Continuing from #16307 (comment)

Hmm I hadn't considered the interaction of limit and extrapolation... I think that's covered by our current limit handling though.

My initial (very rough) thoughts are something like

def interpolate(self, method='linear', axis=0, limit=None, inplace=False,
                limit_direction='forward', downcast=None, extrapolate=None, **kwargs)
    """
    ...
    extrapolate : array-like, two-tuple, "extrapolate", or None, optional

        This is similar to scipy.interpolate.interp1d's `fill_value` keyword,
        with special handling for pandas interpolation methods. By default,
        pandas interpolation methods (...) will extrapolate forward only by
        repeating the last valid observation, while scipy methods will not
        interpolate (following the default for scipy). To disable extrapolation
        for pandas methods, use `extrapolate=np.nan`.

The difference between pandas and scipy methods is unfortunate, but I don't think it's worth deprecating one or the other (willing to change my mind on this).

I don't think an interpolate=True argument is necessary. @WBare d

WBare · 2017-05-22T17:26:28Z

I don't think an interpolate=True argument is necessary.

@TomAugspurger I agree in that I personally do not have a case where I would only want to extrapolate only, BUT, I'm concerned we are arbitrarily eliminating a use case that should get completed while we are in the code.

More generally, as a complete set of options, if we have the ability to interpolate only, why not the ability to extrapolate only? I can imagine a user that would like to extend a series, but not disturb existing NaNs inside the series.

I have not needed that myself so I'm basing this more on a logically complete set options than personal experience.

With that said, if we do choose to have an option for both, I'm changing my mind on the parameter. I like your initial idea better (something like limit_type='interpolate' | 'extrapolate') as opposed to extrapolate=False because the logical flags allow for a combination that would never make sense (i.e. both interpolate=False and extrapolate=False). The single parameter more clearly conveys the idea of limiting to one or the other.

TomAugspurger · 2017-05-22T18:04:55Z

BUT, I'm concerned we are arbitrarily eliminating a use case that should get completed while we are in the code.

Yeah, that makes sense. You'll just have to balance ease of implementation with not breaking existing code :)

WBare · 2017-05-22T18:35:46Z

OK, cool. Any ideas on the actual parameter?

Is everyone cool with

limit_type= ('interpolate' | 'extrapolate' | None=default to both which is current_behavior)

This seems consistent with existing parameters limit=n and limit_direction

naifrec · 2017-05-23T11:32:53Z

hey guys, if you let me jump in. I have the feeling the limit kwarg does not behave as you would expect it to when working with time series. To cite @rhkarls in the issue #1892 :

Say limit=2, if there is a NaN gap of 2 it would be completely filled with interpolated values. If there is a NaN gap of 4 nothing is filled, which is different from the fillna limit where the two first entries would be filled when using forward filling. This is very applicable for time series where it is often valid to interpolate between small gaps, while larger gaps should not be filled.

So lemme write an example:

import pandas as pd


df = pd.DataFrame(
    index=pd.date_range(
        start='02-01-2017 06:00:00',
        end='02-07-2017 06:00:00'),
    data={'A': range(7)})
df = df.drop(pd.to_datetime('2017-02-02 06:00:00'), axis=0)

df.head()

                     A
2017-02-01 06:00:00  0
2017-02-03 06:00:00  2
2017-02-04 06:00:00  3
2017-02-05 06:00:00  4
2017-02-06 06:00:00  5

Now what I want is to resample and interpolate the time series every 12 hours, but only for the consecutive days, so as not to make too big assumptions on the behavior of the time series for larger time deltas. That is not immediately possible currently, because of how limit works. See below, where putting limit of 2 (i.e. limit of a day) means that if two consecutive values are NaN, please do not fill in:

df.resample(rule='12H',base=6).interpolate('time', limit=2)

                       A
2017-02-01 06:00:00  0.0
2017-02-01 18:00:00  0.5  # I would expect this to be NaN
2017-02-02 06:00:00  1.0  # I would expect this to be NaN
2017-02-02 18:00:00  NaN
2017-02-03 06:00:00  2.0
2017-02-03 18:00:00  2.5
2017-02-04 06:00:00  3.0
2017-02-04 18:00:00  3.5
2017-02-05 06:00:00  4.0
2017-02-05 18:00:00  4.5
2017-02-06 06:00:00  5.0
2017-02-06 18:00:00  5.5
2017-02-07 06:00:00  6.0
In [ ]:

To achieve what I want now, I have to use these functions I made:

def interpolate_consecutive(df, frequency):
    """
    Only interpolates value at the frequency asked if the
    values where separated by a day.
    
    Paramteres
    ----------
    df : pd.DataFrame
        Dataframe with Time series index
    frequency : basestring
        Frequency to use to resample then interpolate.
        Only expects 'H' or 'T' based rules, but that's
        because I only need to support these in my case.
    
    Returns
    -------
    df : pd.DataFrame
        Resampled and interpolated dataframe.

    """
    base = 6 if 'H' in frequency else 0
    start_indices, end_indices = get_non_consecutive(
        df, pd.Timedelta(days=1))
    df = df.resample(rule=frequency, base=base).interpolate('time')

    indices_to_drop = []
    for start_date, end_date in zip(start_indices, end_indices):
        indices_to_drop.extend(list(df.index[
            np.logical_and(start_date < df.index,
                           df.index < end_date)]))
    df.drop(indices_to_drop, axis=0, inplace=True)
    return df


def get_non_consecutive(df, timedelta):
    """
    Get the tuple start_indices, end_indices of all
    non consecutive period in the dataframe index.
    Two timestamps separated with more than timedelta
    are considered non consecutive.
    
    Parameters
    ----------
    df : pandas.DataFrame
        Dataframe with Time series index
    timedelta : pd.Timedelta
        Time delta.
    
    Returns
    -------
    start_dates : array-like
        List of start dates of non consecutive periods
    end_dates : array-like
        List of end dates of non consecutive periods

    """
    where = np.where(
        df.index[1:] - df.index[:-1] > timedelta)[0]
    return df.index[where], df.index[where + 1]

using these function I now get my desired output:

interpolate_consecutive(df, '12H')

                       A
2017-02-01 06:00:00  0.0
2017-02-03 06:00:00  2.0
2017-02-03 18:00:00  2.5
2017-02-04 06:00:00  3.0
2017-02-04 18:00:00  3.5
2017-02-05 06:00:00  4.0
2017-02-05 18:00:00  4.5
2017-02-06 06:00:00  5.0
2017-02-06 18:00:00  5.5
2017-02-07 06:00:00  6.0

tldr, limit should actually not always do forward filling, but check the length of the NaN gap and not fill in anything if this gap is longer than the limit.

Thank you for taking the time to read this, hope I made myself clear.

TomAugspurger · 2017-05-23T12:22:01Z

@naifrec thanks for the detailed example, I think I understand the behavior you're looking for.

limit currently has the clearly defined behavior of "fill at most this many NaNs in a row", which is useful so we can't change that. We'll have to add another keyword to interpolate.

I think we should add an additional option to limit_direction like consecutive (there's probably a better word. Something that describes "all or nothing").

Could you open up a new issue for this (you can just copy your last message). This issue is focusing on extrapolation (which would be orthogonal to this issue).

WBare · 2017-05-23T12:56:28Z

Perhaps max_gap meaning it will only interpolate over gaps up to a given size?

WBare · 2017-05-25T12:37:29Z

I'm going to get started on this.

I think we need to move the naifrec idea of limiting "gap size" or "all or none" to another issue.

I did not get any comments on the suggested parameter, so I will use if everyone is cool with that.

limit_type= ('interpolate' | 'extrapolate' | None=default to both which is current_behavior)

WBare · 2017-05-25T20:53:06Z

@TomAugspurger , I've got this change ready to go but in writing the docs, I realized I may create a little confusion.

Technically we are limiting the the .interpolate method to either doing an interpolation or an extrapolation, but since the name of the method is interpolate, it seems weird, from a documentation perspective, to say we can limit the interpolate method only extrapolate and not interpolate.

It is easy to describe these values as 'inside' (i.e. NaNs surrounded by valid values - interpolated), or 'outside' (beyond any existing valid value. How about if we call it this:

limit_range= ('inside' | 'outside' | None=default to both which is current_behavior)

WBare · 2017-05-25T21:08:42Z

Or, since range has meaning, limit_area =('inside' | 'outside') may be even better. That sort of fits with limit_direction since you move in a direction and move in an area.

jreback · 2018-02-01T13:44:53Z

thanks @WBare finally go this in!

closes pandas-dev#16284

WBare · 2018-03-07T19:23:06Z

Hi @TomAugspurger and @jreback,

thanks for getting this done. I just logged into GitHub and I saw you two had to take this over the finish line. I apologize for that. I thought we were totally done last year and I have not been back on GitHub since then. I will be more careful to check the status if I submit again in the future.

TomAugspurger added API Design Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations labels May 8, 2017

TomAugspurger added this to the 0.21.0 milestone May 8, 2017

WBare mentioned this issue May 9, 2017

Fix interpolate -limit Add interpolate limit_direction='inside' #16307

Closed

TomAugspurger changed the title ~~Enhancement Request: interpolate limit_direction='inside'~~ Enhancement Request: control extrapolation on .interpolate May 22, 2017

jreback mentioned this issue May 23, 2017

limit keyword for interpolate #1892

Closed

WBare mentioned this issue May 26, 2017

ENH: interpolate.limit_area() 16284 #16513

Closed

4 tasks

jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

shoyer mentioned this issue Oct 13, 2017

resample().interpolate() should not fill pre-existing NaNs #17868

Open

jreback modified the milestones: Next Major Release, 0.23.0 Jan 21, 2018

jreback closed this as completed in 35812ea Feb 1, 2018

harisbal pushed a commit to harisbal/pandas that referenced this issue Feb 28, 2018

ENH limit_area added to interpolate1d

a11f48d

closes pandas-dev#16284

Marion-Odette-Solis mentioned this issue Nov 5, 2020

P04 - fillna limit IIC2115/Syllabus-2020-2#135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement Request: control extrapolation on .interpolate #16284

Enhancement Request: control extrapolation on .interpolate #16284

WBare commented May 8, 2017 •

edited by TomAugspurger

Loading

TomAugspurger commented May 8, 2017

WBare commented May 8, 2017

WBare commented May 9, 2017

TomAugspurger commented May 22, 2017

WBare commented May 22, 2017 •

edited

Loading

TomAugspurger commented May 22, 2017

WBare commented May 22, 2017

naifrec commented May 23, 2017 •

edited

Loading

TomAugspurger commented May 23, 2017

WBare commented May 23, 2017

WBare commented May 25, 2017

WBare commented May 25, 2017

WBare commented May 25, 2017

jreback commented Feb 1, 2018

WBare commented Mar 7, 2018

Enhancement Request: control extrapolation on .interpolate #16284

Enhancement Request: control extrapolation on .interpolate #16284

Comments

WBare commented May 8, 2017 • edited by TomAugspurger Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented May 8, 2017

WBare commented May 8, 2017

WBare commented May 9, 2017

TomAugspurger commented May 22, 2017

WBare commented May 22, 2017 • edited Loading

TomAugspurger commented May 22, 2017

WBare commented May 22, 2017

naifrec commented May 23, 2017 • edited Loading

TomAugspurger commented May 23, 2017

WBare commented May 23, 2017

WBare commented May 25, 2017

WBare commented May 25, 2017

WBare commented May 25, 2017

jreback commented Feb 1, 2018

WBare commented Mar 7, 2018

WBare commented May 8, 2017 •

edited by TomAugspurger

Loading

Output of `pd.show_versions()`

WBare commented May 22, 2017 •

edited

Loading

naifrec commented May 23, 2017 •

edited

Loading