-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DataFrame.interpolate() is not equivalent to scipy.interpolate.interp1d #8796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also note that DataFrame.interpolate() is waaaaaaaaaaaay faster than my hand-written loop. This may have something to do with the bug, as the majority of time in my loop is spent re-creating the interpolation function for each Series. |
Thanks for the report. I'll check it out this weekend. |
I got some time to look at this and the bug is definitely in I'm continuing to dig, but I now suspect that the bug is in the way valid indices are chosen for the interpolation. |
Further investigation reveals the following:
So the bug is almost definitely in EDIT: Sorry. Closed the issue by accident |
Right. The issue is floating point imprecision when a float index is used.
In and of itself, this is fine. I'm not the first person to be bitten by floating point imprecision and I won't be the last. The issue might be closed as a PEBKAC error. However, pandas actively encourages you to use the Alternatively, |
@cgkanchi thanks for digging into this. When I originally implemented this, our CI tests on Travis failed with floating-point precision issues that I couldn't reproduce locally. Initially I wanted to take a column to interpolate by, but for simplicity I just allowed interpolation on the index. I don't really have an objection to that as an alternative API, but I don't like the idea of
If I have a chance I'll see what's going wrong in the construction of the Float64 index. |
The issue is that what's being passed to the
is if we ensure that Float64Index checks for "close" values in the original index and coerces the new index values to those. On the one hand, this seems perfectly reasonable and would prevent people getting bitten. On the other hand, this may break existing code. The new behaviour could be optional of course by adding a Again, I'm quite happy to work on a patch for this, provided you and/or the other maintainers feel that this is the way to go. |
Any progress on this? Happy to help with the patch if an approach is decided. |
We were taking a look at Pandas objects as underlying data structures for our colour science API and one of the feature that we were looking at was support for a floating point index along data interpolation which leads me to that issue. I'm wondering the same than @cgkanchi, is there anything planned to fix the current discrepancies regarding interpolation results? |
related discussion in #9340 |
not sure how #9340 is related? can we keep 2 issues independent? i do not see any pull requests here that could conflict. |
Related I think - Scipy and pandas interpolation is different for Is this actually a PEBKAC? What is the best (and most computationally efficient) work-around here (other than something like |
Ok, I took another look at this. Going back to the original post, you'll get identical results if you define the def pandas_interpolate(df, interp_column, method='cubic'):
df = df.set_index(interp_column)
# previously it was the next line. Change it to take the union of new and old
# df = df.reindex(numpy.arange(df.index.min(), df.index.max(), 0.0005))
at = numpy.arange(df.index.min(), df.index.max(), 0.0005)
df = df.reindex(df.index | at)
df = df.interpolate(method=method).loc[at]
df = df.reset_index()
df = df.rename(columns={'index': interp_column})
return df The difference came down to passing different values into |
@TomAugspurger thanks for coming back to this - just wanted to check, in my case I can't see how reindexing helps (I tried anyway - same result, I suppose my issue is slightly different to that of the OP, should I open a new issue (or possibly I don't understand enough about what's happening under the hood to see what I need to change in my test case)? |
@stringfellow sorry, meant to follow up on that. I think that's because of how they treat the index. From the docstring:
So in general |
@TomAugspurger thanks again - I've just gone back to the source of my info that " |
Sorry - can you link me to that docstr? Can't seem to find it. |
… On Thu, May 4, 2017 at 9:05 AM, Steve Pike ***@***.***> wrote:
Sorry - can you link me to that docstr? Can't seem to find it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8796 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIpsMiiiQKEa79WmnBU4vNoEVScejks5r2ds4gaJpZM4C6Jav>
.
|
@TomAugspurger, @jreback Hi All, it's very useful post, I want to do multidimensions interpolation with dataframe (akima) method but facing some issue for limit_direction. Based on CCY_CODE,END_DATE,STRIKE want to interpolate VOLATILITY, Appreciate you can help ?
Also, interested to know how to use scipy.interpolate.rbf in the same example ? Thanks! |
@SPBST it doesn't look like that's related to this issue.
…On Sat, Jun 16, 2018 at 11:05 AM, SPBST ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger>, @jreback
<https://github.com/jreback> Hi All, it's very useful post, I want to do
multidimensions interpolation with dataframe (akima) method but facing some
issue for limit_direction, Appreciate you can help ?
import pandas as pd import numpy as np raw_data = {'CCY_CODE':
['SGD','USD','USD','USD','USD','USD','USD','EUR','EUR','EUR','EUR','EUR','EUR','USD'],
'END_DATE': ['16/03/2018','17/03/2018','17/03/2018','17/03/2018','17/
03/2018','17/03/2018','17/03/2018', '17/03/2018','17/03/2018','17/
03/2018','17/03/2018','17/03/2018','17/03/2018','17/03/2018'],
'STRIKE':[0.005,0.01,0.015,0.02,0.025,0.03,0.035,0.04,0.
045,0.05,0.55,0.06,0.065,0.07], 'VOLATILITY':[np.nan,np.nan,0.
3424,np.nan,0.2617,0.2414,np.nan,np.nan,0.215,0.212,0.2103,np.nan,0.2092,np.nan]
} df_volsurface = pd.DataFrame(raw_data,columns = ['CCY_CODE','END_DATE','STRIKE','VOLATILITY'])
df_volsurface['END_DATE'] = pd.to_datetime(df_volsurface['END_DATE'])
df_volsurface.interpolate(method='akima',limit_direction='both')
https://stackoverflow.com/questions/50819549/dateframe-
interpolate-not-working-in-panda-multidimensional-interpolation
<http://url>
Also, interested to know how to use scipy.interpolate.rbf in the same
example ? Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8796 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIpbYrwwh4Bcj3i9o9eYTBj9ScDchks5t9Sy2gaJpZM4C6Jav>
.
|
@TomAugspurger, Thanks for reply. Any suggestion how should I report this issue or get any help for it ? Thanks! |
For using RBF? You could make a new issue. If it has the same signature as
the others, it should be straightforward.
Otherwise, we recommand stackoverflow of usage questions, and github issues
for bugs.
…On Thu, Jun 21, 2018 at 6:49 AM, SPBST ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger>, Thanks for reply. Any
suggestion how should I report this issue or get any help for it ? Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8796 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIqC4eV_0Jz4X83_jcxNVXmGTxCdbks5t-4g-gaJpZM4C6Jav>
.
|
how can you interpolate only if you have 2 nans values? and where there are more than 2 nan values don't interpolate |
I'm seeing this issue pop up in time series analysis. Specifically with irregularly sampled dates. I resample these to daily frequency and then interpolate (tried linear, time, slinear, values). I convert the daily frequencies to decimal years (maybe here comes aforementioned machine precision, but I don't think so...) and use this in interp1d. I can provide data and code if desired. |
When pandas is used to interpolate data, the results are not the same as what you get from scipy.interpolate.interp1d.
When using with simple data, the differences are small (see images). However, when used with real-world data, the differences can be large enough to throw off some algorithms that depend on the values of the interpolated data.
In the images, notice two things. First that the results are not the same between the two methods and second, that pandas omits the last point. Manually adding the last point fixes the simple sin(x) case, but not the lat/lon case.
I've also tried with method/kind='linear' with much the same results.
Tested on pandas 0.14.1
To replicate, just run the code below:
The data file interp_test.csv can be found at https://github.com/cgkanchi/pandas_interpolate_bug
The text was updated successfully, but these errors were encountered: