Skip to content

DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cgkanchi opened this issue Nov 12, 2014 · 25 comments
Open
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@cgkanchi
Copy link

When pandas is used to interpolate data, the results are not the same as what you get from scipy.interpolate.interp1d.

When using with simple data, the differences are small (see images). However, when used with real-world data, the differences can be large enough to throw off some algorithms that depend on the values of the interpolated data.

In the images, notice two things. First that the results are not the same between the two methods and second, that pandas omits the last point. Manually adding the last point fixes the simple sin(x) case, but not the lat/lon case.

I've also tried with method/kind='linear' with much the same results.

Tested on pandas 0.14.1

To replicate, just run the code below:

import numpy
import pandas
from matplotlib import pyplot
import scipy.interpolate

def pandas_interpolate(df, interp_column, method='cubic'):
    df = df.set_index(interp_column)
    df = df.reindex(numpy.arange(df.index.min(), df.index.max(), 0.0005))
    df = df.interpolate(method=method)
    df = df.reset_index()
    df = df.rename(columns={'index': interp_column})
    return df

def scipy_interpolate(df, interp_column, method='cubic'):
    series = {}
    new_x = numpy.arange(df[interp_column].min(), df[interp_column].max(), 0.0005)

    for column in df:
        if column == interp_column: 
            series[column] = new_x
        else:
            interp_f = scipy.interpolate.interp1d(df[interp_column], df[column], kind=method)
            series[column] = interp_f(new_x)

    return pandas.DataFrame(series)


if __name__ == '__main__':
    df = pandas.read_csv('interp_test.csv')
    pd_interp = pandas_interpolate(df, 'distance_km', 'cubic')
    scipy_interp = scipy_interpolate(df, 'distance_km', 'cubic')

    #pyplot.plot(df['lon'], df['lat'], label='raw data')
    pyplot.plot(pd_interp['lon'], pd_interp['lat'], label='pandas')
    pyplot.plot(scipy_interp['lon'], scipy_interp['lat'], label='scipy interp1d')
    pyplot.legend(loc='best')

    pyplot.figure()
    df2 = pandas.DataFrame({'x': numpy.arange(10), 'sin(x)': numpy.sin(numpy.arange(10))})
    pd_interp2 = pandas_interpolate(df2, 'x', 'cubic')
    scipy_interp2 = scipy_interpolate(df2, 'x', 'cubic')
    pyplot.plot(pd_interp2['x'], pd_interp2['sin(x)'], label='pandas')
    pyplot.plot(scipy_interp2['x'], scipy_interp2['sin(x)'], label='scipy interp1d')
    pyplot.legend(loc='best')

    pyplot.show()

The data file interp_test.csv can be found at https://github.com/cgkanchi/pandas_interpolate_bug

pandas_scipy_lon_lat
pandas_scipy_sin
pandas_scipy_sin_zoom

@jorisvandenbossche
Copy link
Member

cc @TomAugspurger

@cgkanchi
Copy link
Author

Also note that DataFrame.interpolate() is waaaaaaaaaaaay faster than my hand-written loop. This may have something to do with the bug, as the majority of time in my loop is spent re-creating the interpolation function for each Series.

@TomAugspurger
Copy link
Contributor

Thanks for the report. I'll check it out this weekend.

@jreback jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Nov 14, 2014
@cgkanchi
Copy link
Author

I got some time to look at this and the bug is definitely in pandas.core.common.interpolate_1d() or one of the functions that calls it. This uses _interpolate_scipy_wrapper() internally, and that function returns exactly equivalent values to scipy.interpolate.interp1d() for both the kind/method = 'linear' and 'cubic' cases.

I'm continuing to dig, but I now suspect that the bug is in the way valid indices are chosen for the interpolation.

@cgkanchi
Copy link
Author

Further investigation reveals the following:

>>> df = pandas.DataFrame({'x': arange(10), 'y': cos(arange(10))})
>>> df = df.set_index('x')
>>> df = df.reindex(arange(0, 9, 0.01))
>>> df = df.reset_index()
>>> test1 = pandas.core.common.interpolate_1d(df['x'].values, df['y'].values, method='linear')
>>> test2 = scipy.interpolate.interp1d(arange(10), cos(arange(10)), kind='cubic')(arange(0, 9, 0.01))
>>> all(test1 == test2)
False
>>> len(test[test == test2])
1
>>> #and just to confirm that the inputs are the same
 >>> all(df['x'] == arange(0, 9, 0.01))
True
>>>

So the bug is almost definitely in pandas.core.common.interpolate_1d.

EDIT: Sorry. Closed the issue by accident

@cgkanchi cgkanchi reopened this Nov 24, 2014
@cgkanchi
Copy link
Author

Right. The issue is floating point imprecision when a float index is used.

>>> xs = arange(0, 1, 0.1)
>>> ys = cos(xs)
>>> df = DataFrame({'x': xs, 'y': ys})
>>> df2 = df.set_index('x')
>>> df3 = df.reindex(arange(0, 1, 0.01))
>>> df3.index[30] == df['x'][3]
False
>>> print df3.index[30], df['x'][3]
0.3 0.3
>>> df3.index[30]
0.29999999999999999
>>> df['x'][3]
0.30000000000000004

In and of itself, this is fine. I'm not the first person to be bitten by floating point imprecision and I won't be the last. The issue might be closed as a PEBKAC error.

However, pandas actively encourages you to use the set_index(), reindex(), interpolate(), rename() pattern. That this may not work for floats is a big drawback (who wants to interpolate ints? ints are boring) and at the very least should be mentioned in the documentation for DataFrame.interpolate() and Series.interpolate().

Alternatively, DataFrame.interpolate() should take a column name to interpolate by, instead of using the index. I'm happy to write such a method and submit a patch, if it has any chance of being accepted.

@TomAugspurger
Copy link
Contributor

@cgkanchi thanks for digging into this.

When I originally implemented this, our CI tests on Travis failed with floating-point precision issues that I couldn't reproduce locally.

Initially I wanted to take a column to interpolate by, but for simplicity I just allowed interpolation on the index. I don't really have an objection to that as an alternative API, but I don't like the idea of

df.interpolate(x='col1', y='col2') != df.set_index('x').interpolate('y')

If I have a chance I'll see what's going wrong in the construction of the Float64 index.

@cgkanchi
Copy link
Author

@TomAugspurger

The issue is that what's being passed to the reindex() method is different. The only way we could ensure that

df.interpolate(x='x', new_x=some_iterable) == df.set_index('x').interpolate()

is if we ensure that Float64Index checks for "close" values in the original index and coerces the new index values to those. On the one hand, this seems perfectly reasonable and would prevent people getting bitten. On the other hand, this may break existing code. The new behaviour could be optional of course by adding a corce_index=True or similar argument to the Float64Index constructor/__init__().

Again, I'm quite happy to work on a patch for this, provided you and/or the other maintainers feel that this is the way to go.

@cgkanchi
Copy link
Author

Any progress on this? Happy to help with the patch if an approach is decided.

@KelSolaar
Copy link

We were taking a look at Pandas objects as underlying data structures for our colour science API and one of the feature that we were looking at was support for a floating point index along data interpolation which leads me to that issue. I'm wondering the same than @cgkanchi, is there anything planned to fix the current discrepancies regarding interpolation results?

@jreback
Copy link
Contributor

jreback commented Aug 19, 2015

related discussion in #9340

@den-run-ai
Copy link

not sure how #9340 is related? can we keep 2 issues independent? i do not see any pull requests here that could conflict.

@stringfellow
Copy link

Related I think - Scipy and pandas interpolation is different for slinear (in Scipy slinear and linear are equivalent, in pandas, they are not): https://gist.github.com/stringfellow/8ae4d3f25ca525e75bb79c01fbda4a24

See comment here http://stackoverflow.com/questions/27698604/what-do-the-different-values-of-the-kind-argument-mean-in-scipy-interpolate-inte/27698894?noredirect=1#comment74381296_27698894

Is this actually a PEBKAC? What is the best (and most computationally efficient) work-around here (other than something like .round(7))?

@TomAugspurger
Copy link
Contributor

Ok, I took another look at this. Going back to the original post, you'll get identical results if you define the pandas_interpolate as

def pandas_interpolate(df, interp_column, method='cubic'):
    df = df.set_index(interp_column)
    # previously it was the next line. Change it to take the union of new and old
    # df = df.reindex(numpy.arange(df.index.min(), df.index.max(), 0.0005))
    at = numpy.arange(df.index.min(), df.index.max(), 0.0005)
    df = df.reindex(df.index | at)
    df = df.interpolate(method=method).loc[at]
    df = df.reset_index()
    df = df.rename(columns={'index': interp_column})
    return df

figure_1
figure_2

The difference came down to passing different values into interp1d, which I think is due to the confusing API. I'll hopefully have a chance to work on #9340 soon.

@stringfellow
Copy link

@TomAugspurger thanks for coming back to this - just wanted to check, in my case I can't see how reindexing helps (I tried anyway - same result, slinear != linear, critically - introducing very small decimals in the case of slinear)?

I suppose my issue is slightly different to that of the OP, should I open a new issue (or possibly I don't understand enough about what's happening under the hood to see what I need to change in my test case)?

@TomAugspurger
Copy link
Contributor

@stringfellow sorry, meant to follow up on that.

I think that's because of how they treat the index. From the docstring:


    * 'linear': ignore the index and treat the values as equally
      spaced. This is the only method supported on MultiIndexes.
      default

So in general slinear == linear when the index is equally spaced. Otherwise, they'll be unequal.

@stringfellow
Copy link

@TomAugspurger thanks again - I've just gone back to the source of my info that "slinear and linear are the same in scipy" and made a test, turns out that was bad information so sorry for spreading the FUD!

@stringfellow
Copy link

Sorry - can you link me to that docstr? Can't seem to find it.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 4, 2017 via email

@SPBST
Copy link

SPBST commented Jun 16, 2018

@TomAugspurger, @jreback Hi All, it's very useful post, I want to do multidimensions interpolation with dataframe (akima) method but facing some issue for limit_direction. Based on CCY_CODE,END_DATE,STRIKE want to interpolate VOLATILITY, Appreciate you can help ?

import pandas as pd import numpy as np raw_data = {'CCY_CODE': ['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'], 'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018', '17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'], 'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.045,0.05,0.55,0.06,0.065,0.07], 'VOLATILITY':[np.nan,np.nan,0.3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan] } df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY']) df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE']) df_volsurface.interpolate(method='akima',limit_direction='both')

https://stackoverflow.com/questions/50819549/dateframe-interpolate-not-working-in-panda-multidimensional-interpolation

Also, interested to know how to use scipy.interpolate.rbf in the same example ? Thanks!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 20, 2018 via email

@SPBST
Copy link

SPBST commented Jun 21, 2018

@TomAugspurger, Thanks for reply. Any suggestion how should I report this issue or get any help for it ? Thanks!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 21, 2018 via email

@martialo12
Copy link

how can you interpolate only if you have 2 nans values? and where there are more than 2 nan values don't interpolate

@mroeschke mroeschke added the Bug label Apr 11, 2021
@bbuzz31
Copy link

bbuzz31 commented Jan 4, 2023

I'm seeing this issue pop up in time series analysis. Specifically with irregularly sampled dates. I resample these to daily frequency and then interpolate (tried linear, time, slinear, values). I convert the daily frequencies to decimal years (maybe here comes aforementioned machine precision, but I don't think so...) and use this in interp1d. I can provide data and code if desired.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests