DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796

cgkanchi · 2014-11-12T14:53:51Z

When pandas is used to interpolate data, the results are not the same as what you get from scipy.interpolate.interp1d.

When using with simple data, the differences are small (see images). However, when used with real-world data, the differences can be large enough to throw off some algorithms that depend on the values of the interpolated data.

In the images, notice two things. First that the results are not the same between the two methods and second, that pandas omits the last point. Manually adding the last point fixes the simple sin(x) case, but not the lat/lon case.

I've also tried with method/kind='linear' with much the same results.

Tested on pandas 0.14.1

To replicate, just run the code below:

import numpy
import pandas
from matplotlib import pyplot
import scipy.interpolate

def pandas_interpolate(df, interp_column, method='cubic'):
    df = df.set_index(interp_column)
    df = df.reindex(numpy.arange(df.index.min(), df.index.max(), 0.0005))
    df = df.interpolate(method=method)
    df = df.reset_index()
    df = df.rename(columns={'index': interp_column})
    return df

def scipy_interpolate(df, interp_column, method='cubic'):
    series = {}
    new_x = numpy.arange(df[interp_column].min(), df[interp_column].max(), 0.0005)

    for column in df:
        if column == interp_column: 
            series[column] = new_x
        else:
            interp_f = scipy.interpolate.interp1d(df[interp_column], df[column], kind=method)
            series[column] = interp_f(new_x)

    return pandas.DataFrame(series)


if __name__ == '__main__':
    df = pandas.read_csv('interp_test.csv')
    pd_interp = pandas_interpolate(df, 'distance_km', 'cubic')
    scipy_interp = scipy_interpolate(df, 'distance_km', 'cubic')

    #pyplot.plot(df['lon'], df['lat'], label='raw data')
    pyplot.plot(pd_interp['lon'], pd_interp['lat'], label='pandas')
    pyplot.plot(scipy_interp['lon'], scipy_interp['lat'], label='scipy interp1d')
    pyplot.legend(loc='best')

    pyplot.figure()
    df2 = pandas.DataFrame({'x': numpy.arange(10), 'sin(x)': numpy.sin(numpy.arange(10))})
    pd_interp2 = pandas_interpolate(df2, 'x', 'cubic')
    scipy_interp2 = scipy_interpolate(df2, 'x', 'cubic')
    pyplot.plot(pd_interp2['x'], pd_interp2['sin(x)'], label='pandas')
    pyplot.plot(scipy_interp2['x'], scipy_interp2['sin(x)'], label='scipy interp1d')
    pyplot.legend(loc='best')

    pyplot.show()

The data file interp_test.csv can be found at https://github.com/cgkanchi/pandas_interpolate_bug

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2014-11-12T15:19:40Z

cc @TomAugspurger

cgkanchi · 2014-11-12T15:23:34Z

Also note that DataFrame.interpolate() is waaaaaaaaaaaay faster than my hand-written loop. This may have something to do with the bug, as the majority of time in my loop is spent re-creating the interpolation function for each Series.

TomAugspurger · 2014-11-12T16:09:25Z

Thanks for the report. I'll check it out this weekend.

cgkanchi · 2014-11-24T17:21:17Z

I got some time to look at this and the bug is definitely in pandas.core.common.interpolate_1d() or one of the functions that calls it. This uses _interpolate_scipy_wrapper() internally, and that function returns exactly equivalent values to scipy.interpolate.interp1d() for both the kind/method = 'linear' and 'cubic' cases.

I'm continuing to dig, but I now suspect that the bug is in the way valid indices are chosen for the interpolation.

cgkanchi · 2014-11-24T17:45:49Z

Further investigation reveals the following:

>>> df = pandas.DataFrame({'x': arange(10), 'y': cos(arange(10))})
>>> df = df.set_index('x')
>>> df = df.reindex(arange(0, 9, 0.01))
>>> df = df.reset_index()
>>> test1 = pandas.core.common.interpolate_1d(df['x'].values, df['y'].values, method='linear')
>>> test2 = scipy.interpolate.interp1d(arange(10), cos(arange(10)), kind='cubic')(arange(0, 9, 0.01))
>>> all(test1 == test2)
False
>>> len(test[test == test2])
1
>>> #and just to confirm that the inputs are the same
 >>> all(df['x'] == arange(0, 9, 0.01))
True
>>>

So the bug is almost definitely in pandas.core.common.interpolate_1d.

EDIT: Sorry. Closed the issue by accident

cgkanchi · 2014-11-25T07:23:00Z

Right. The issue is floating point imprecision when a float index is used.

>>> xs = arange(0, 1, 0.1)
>>> ys = cos(xs)
>>> df = DataFrame({'x': xs, 'y': ys})
>>> df2 = df.set_index('x')
>>> df3 = df.reindex(arange(0, 1, 0.01))
>>> df3.index[30] == df['x'][3]
False
>>> print df3.index[30], df['x'][3]
0.3 0.3
>>> df3.index[30]
0.29999999999999999
>>> df['x'][3]
0.30000000000000004

In and of itself, this is fine. I'm not the first person to be bitten by floating point imprecision and I won't be the last. The issue might be closed as a PEBKAC error.

However, pandas actively encourages you to use the set_index(), reindex(), interpolate(), rename() pattern. That this may not work for floats is a big drawback (who wants to interpolate ints? ints are boring) and at the very least should be mentioned in the documentation for DataFrame.interpolate() and Series.interpolate().

Alternatively, DataFrame.interpolate() should take a column name to interpolate by, instead of using the index. I'm happy to write such a method and submit a patch, if it has any chance of being accepted.

TomAugspurger · 2014-11-25T13:31:51Z

@cgkanchi thanks for digging into this.

When I originally implemented this, our CI tests on Travis failed with floating-point precision issues that I couldn't reproduce locally.

Initially I wanted to take a column to interpolate by, but for simplicity I just allowed interpolation on the index. I don't really have an objection to that as an alternative API, but I don't like the idea of

df.interpolate(x='col1', y='col2') != df.set_index('x').interpolate('y')

If I have a chance I'll see what's going wrong in the construction of the Float64 index.

cgkanchi · 2014-11-25T13:42:15Z

@TomAugspurger

The issue is that what's being passed to the reindex() method is different. The only way we could ensure that

df.interpolate(x='x', new_x=some_iterable) == df.set_index('x').interpolate()

is if we ensure that Float64Index checks for "close" values in the original index and coerces the new index values to those. On the one hand, this seems perfectly reasonable and would prevent people getting bitten. On the other hand, this may break existing code. The new behaviour could be optional of course by adding a corce_index=True or similar argument to the Float64Index constructor/__init__().

Again, I'm quite happy to work on a patch for this, provided you and/or the other maintainers feel that this is the way to go.

cgkanchi · 2015-01-12T18:31:42Z

Any progress on this? Happy to help with the patch if an approach is decided.

KelSolaar · 2015-08-19T10:10:02Z

We were taking a look at Pandas objects as underlying data structures for our colour science API and one of the feature that we were looking at was support for a floating point index along data interpolation which leads me to that issue. I'm wondering the same than @cgkanchi, is there anything planned to fix the current discrepancies regarding interpolation results?

jreback · 2015-08-19T11:03:47Z

related discussion in #9340

den-run-ai · 2015-08-29T16:59:35Z

not sure how #9340 is related? can we keep 2 issues independent? i do not see any pull requests here that could conflict.

stringfellow · 2017-05-03T11:22:47Z

Related I think - Scipy and pandas interpolation is different for slinear (in Scipy slinear and linear are equivalent, in pandas, they are not): https://gist.github.com/stringfellow/8ae4d3f25ca525e75bb79c01fbda4a24

See comment here http://stackoverflow.com/questions/27698604/what-do-the-different-values-of-the-kind-argument-mean-in-scipy-interpolate-inte/27698894?noredirect=1#comment74381296_27698894

Is this actually a PEBKAC? What is the best (and most computationally efficient) work-around here (other than something like .round(7))?

TomAugspurger · 2017-05-03T19:47:52Z

Ok, I took another look at this. Going back to the original post, you'll get identical results if you define the pandas_interpolate as

def pandas_interpolate(df, interp_column, method='cubic'):
    df = df.set_index(interp_column)
    # previously it was the next line. Change it to take the union of new and old
    # df = df.reindex(numpy.arange(df.index.min(), df.index.max(), 0.0005))
    at = numpy.arange(df.index.min(), df.index.max(), 0.0005)
    df = df.reindex(df.index | at)
    df = df.interpolate(method=method).loc[at]
    df = df.reset_index()
    df = df.rename(columns={'index': interp_column})
    return df

The difference came down to passing different values into interp1d, which I think is due to the confusing API. I'll hopefully have a chance to work on #9340 soon.

stringfellow · 2017-05-04T09:37:57Z

@TomAugspurger thanks for coming back to this - just wanted to check, in my case I can't see how reindexing helps (I tried anyway - same result, slinear != linear, critically - introducing very small decimals in the case of slinear)?

I suppose my issue is slightly different to that of the OP, should I open a new issue (or possibly I don't understand enough about what's happening under the hood to see what I need to change in my test case)?

TomAugspurger · 2017-05-04T11:02:43Z

@stringfellow sorry, meant to follow up on that.

I think that's because of how they treat the index. From the docstring:


    * 'linear': ignore the index and treat the values as equally
      spaced. This is the only method supported on MultiIndexes.
      default

So in general slinear == linear when the index is equally spaced. Otherwise, they'll be unequal.

stringfellow · 2017-05-04T14:00:43Z

@TomAugspurger thanks again - I've just gone back to the source of my info that "slinear and linear are the same in scipy" and made a test, turns out that was bad information so sorry for spreading the FUD!

stringfellow · 2017-05-04T14:05:39Z

Sorry - can you link me to that docstr? Can't seem to find it.

TomAugspurger · 2017-05-04T14:17:07Z

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html?highlight=interpolate#pandas.DataFrame.interpolate

…

On Thu, May 4, 2017 at 9:05 AM, Steve Pike ***@***.***> wrote: Sorry - can you link me to that docstr? Can't seem to find it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8796 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIpsMiiiQKEa79WmnBU4vNoEVScejks5r2ds4gaJpZM4C6Jav> .

SPBST · 2018-06-16T16:05:05Z

@TomAugspurger, @jreback Hi All, it's very useful post, I want to do multidimensions interpolation with dataframe (akima) method but facing some issue for limit_direction. Based on CCY_CODE,END_DATE,STRIKE want to interpolate VOLATILITY, Appreciate you can help ?

import pandas as pd import numpy as np raw_data = {'CCY_CODE': ['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'], 'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018', '17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'], 'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05,0.55,0.06,0.065,0.07], 'VOLATILITY':[np.nan,np.nan,0.3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan] } df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY']) df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE']) df_volsurface.interpolate(method='akima',limit_direction='both')

https://stackoverflow.com/questions/50819549/dateframe-interpolate-not-working-in-panda-multidimensional-interpolation

Also, interested to know how to use scipy.interpolate.rbf in the same example ? Thanks!

TomAugspurger · 2018-06-20T19:11:58Z

@SPBST it doesn't look like that's related to this issue.

…

On Sat, Jun 16, 2018 at 11:05 AM, SPBST ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger>, @jreback <https://github.com/jreback> Hi All, it's very useful post, I want to do multidimensions interpolation with dataframe (akima) method but facing some issue for limit_direction, Appreciate you can help ? import pandas as pd import numpy as np raw_data = {'CCY_CODE': ['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'], 'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/ 03/2018','17/03/2018','17/03/2018', '17/03/2018','17/03/2018','17/ 03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'], 'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0. 045,0.05,0.55,0.06,0.065,0.07], 'VOLATILITY':[np.nan,np.nan,0. 3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan] } df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY']) df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE']) df_volsurface.interpolate(method='akima',limit_direction='both') https://stackoverflow.com/questions/50819549/dateframe- interpolate-not-working-in-panda-multidimensional-interpolation <http://url> Also, interested to know how to use scipy.interpolate.rbf in the same example ? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8796 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIpbYrwwh4Bcj3i9o9eYTBj9ScDchks5t9Sy2gaJpZM4C6Jav> .

SPBST · 2018-06-21T11:49:12Z

@TomAugspurger, Thanks for reply. Any suggestion how should I report this issue or get any help for it ? Thanks!

TomAugspurger · 2018-06-21T12:41:53Z

For using RBF? You could make a new issue. If it has the same signature as the others, it should be straightforward. Otherwise, we recommand stackoverflow of usage questions, and github issues for bugs.

…

On Thu, Jun 21, 2018 at 6:49 AM, SPBST ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger>, Thanks for reply. Any suggestion how should I report this issue or get any help for it ? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8796 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIqC4eV_0Jz4X83_jcxNVXmGTxCdbks5t-4g-gaJpZM4C6Jav> .

martialo12 · 2018-12-17T04:18:10Z

how can you interpolate only if you have 2 nans values? and where there are more than 2 nan values don't interpolate

bbuzz31 · 2023-01-04T15:53:41Z

I'm seeing this issue pop up in time series analysis. Specifically with irregularly sampled dates. I resample these to daily frequency and then interpolate (tried linear, time, slinear, values). I convert the daily frequencies to decimal years (maybe here comes aforementioned machine precision, but I don't think so...) and use this in interp1d. I can provide data and code if desired.

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 14, 2014

cgkanchi closed this as completed Nov 24, 2014

cgkanchi reopened this Nov 24, 2014

mroeschke added the Bug label Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796

DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796

cgkanchi commented Nov 12, 2014

jorisvandenbossche commented Nov 12, 2014

cgkanchi commented Nov 12, 2014

TomAugspurger commented Nov 12, 2014

cgkanchi commented Nov 24, 2014

cgkanchi commented Nov 24, 2014

cgkanchi commented Nov 25, 2014

TomAugspurger commented Nov 25, 2014

cgkanchi commented Nov 25, 2014

cgkanchi commented Jan 12, 2015

KelSolaar commented Aug 19, 2015

jreback commented Aug 19, 2015

den-run-ai commented Aug 29, 2015

stringfellow commented May 3, 2017

TomAugspurger commented May 3, 2017

stringfellow commented May 4, 2017

TomAugspurger commented May 4, 2017

stringfellow commented May 4, 2017

stringfellow commented May 4, 2017

TomAugspurger commented May 4, 2017 via email

SPBST commented Jun 16, 2018 •

edited

Loading

TomAugspurger commented Jun 20, 2018 via email

SPBST commented Jun 21, 2018

TomAugspurger commented Jun 21, 2018 via email

martialo12 commented Dec 17, 2018

bbuzz31 commented Jan 4, 2023 •

edited

Loading

DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796

DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796

Comments

cgkanchi commented Nov 12, 2014

jorisvandenbossche commented Nov 12, 2014

cgkanchi commented Nov 12, 2014

TomAugspurger commented Nov 12, 2014

cgkanchi commented Nov 24, 2014

cgkanchi commented Nov 24, 2014

cgkanchi commented Nov 25, 2014

TomAugspurger commented Nov 25, 2014

cgkanchi commented Nov 25, 2014

cgkanchi commented Jan 12, 2015

KelSolaar commented Aug 19, 2015

jreback commented Aug 19, 2015

den-run-ai commented Aug 29, 2015

stringfellow commented May 3, 2017

TomAugspurger commented May 3, 2017

stringfellow commented May 4, 2017

TomAugspurger commented May 4, 2017

stringfellow commented May 4, 2017

stringfellow commented May 4, 2017

TomAugspurger commented May 4, 2017 via email

SPBST commented Jun 16, 2018 • edited Loading

TomAugspurger commented Jun 20, 2018 via email

SPBST commented Jun 21, 2018

TomAugspurger commented Jun 21, 2018 via email

martialo12 commented Dec 17, 2018

bbuzz31 commented Jan 4, 2023 • edited Loading

SPBST commented Jun 16, 2018 •

edited

Loading

bbuzz31 commented Jan 4, 2023 •

edited

Loading