Skip to content

BUG: interpolate method with 'index' and 'slinear' methods give inconsistent results when there are duplicates in x-axis #42585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
scd75 opened this issue Jul 17, 2021 · 5 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@scd75
Copy link

scd75 commented Jul 17, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

# Example 1: x is ascending, it works with both methods (we expect the NaN value to be filled with "50")
df = pd.DataFrame(
    {
        'x': [-79, -79, 0, 61, 2783],
        'y': [-200, -50, np.nan, -50, 1000],
    }
).set_index(['x'])

print('Example 1')
print(df.head())

df1 = df.interpolate('index')
df2 = df.interpolate('slinear')

print(df1.head())
print(df2.head())


# Example 2: x is descending, it does not work with any method (seems like the point at position 1 is not taken into account)
df = pd.DataFrame(
    {
        'x': [79, 79, 0, -61, -2783],
        'y': [-200, -50, np.nan, -50, 1000],
    }
).set_index(['x'])

print('Example 2')
print(df.head())

df1 = df.interpolate('index')
df2 = df.interpolate('slinear')

print(df1.head())
print(df2.head())


# Example 3: same kind of df as Example 2 (descending x values), but with 20 elements. It works with 'index', not with 'slinear'
df = pd.DataFrame(
    {
        'x': [79, 79, 0, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -2783],
        'y': [-200, -50, np.nan, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, 1000]
    }
).set_index(['x'])

print('Example 3')
print(df.head())

df1 = df.interpolate('index')
df2 = df.interpolate('slinear')

print(df1.head())
print(df2.head())


# Example 4: removing just the elements at positions -2 and -3 in both x and y. it fails with both methods
df = pd.DataFrame(
    {
        'x': [79, 79, 0, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -61, -2783],
        'y': [-200, -50, np.nan, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, -50, 1000]
    }
).set_index(['x'])

print('Example 4')
print(df.head())

df1 = df.interpolate('index')
df2 = df.interpolate('slinear')

print(df1.head())
print(df2.head())

Output:

Example 1
            y
x
-79    -200.0
-79     -50.0
 0        NaN
 61     -50.0
 2783  1000.0
            y
x
-79    -200.0
-79     -50.0
 0      -50.0
 61     -50.0
 2783  1000.0
            y
x
-79    -200.0
-79     -50.0
 0      -50.0
 61     -50.0
 2783  1000.0
Example 2
            y
x
 79    -200.0
 79     -50.0
 0        NaN
-61     -50.0
-2783  1000.0
                 y
x
 79    -200.000000
 79     -50.000000
 0     -115.357143
-61     -50.000000
-2783  1000.000000
                 y
x
 79    -200.000000
 79     -50.000000
 0     -115.357143
-61     -50.000000
-2783  1000.000000
Example 3
         y
x
 79 -200.0
 79  -50.0
 0     NaN
-61  -50.0
-61  -50.0
         y
x
 79 -200.0
 79  -50.0
 0   -50.0
-61  -50.0
-61  -50.0
              y
x
 79 -200.000000
 79  -50.000000
 0  -115.357143
-61  -50.000000
-61  -50.000000
Example 4
         y
x
 79 -200.0
 79  -50.0
 0     NaN
-61  -50.0
-61  -50.0
              y
x
 79 -200.000000
 79  -50.000000
 0  -115.357143
-61  -50.000000
-61  -50.000000
              y
x
 79 -200.000000
 79  -50.000000
 0  -115.357143
-61  -50.000000
-61  -50.000000

Problem description

I am trying to use pandas to interpolate "curves" that can have "steps" (eg. 2 consecutive points with same x-coordinate, but different y-coordinate, or the other way around). There can be an arbitrary number of NaN values between 2 points.
I think this is quite a common use case, and I think i would be cool to have pandas able to perform this kind of "point-to-point" interpolation.
I have been using either "index" or "slinear" methods to do so. While it seems to work in most cases, there seem to be inconsistencies around:

  • the fact that x-axis is ordered in ascending (it works - see Example 1 -) or descending order (it fails - see Example 2 -, more precisely it seems to not consider duplicate values of y)
  • In some cases, for descendingly sorted x-axis 'slinear' vs 'index' do not yield the same result (it works with 'index' while failing with 'slinear' - see Example 3/4 -) but I really would not figure out what is the root cause, appart length of dataframe. weird..

Expected Output

I would like all NA values to be linearly interpolated locally, based on the 'previous' and 'next' points (Side note, for extrapolation, 'nearest'-like approach seems to be the most sensible outcome).

Output of pd.show_versions()

commit : f00ed8f
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : None.None

pandas : 1.3.0
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@scd75 scd75 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 17, 2021
@rhshadrach
Copy link
Member

Thanks for the report! For the expected output, can you provided the expected values from your examples.

@rhshadrach rhshadrach added the Numeric Operations Arithmetic, Comparison, and Logical operations label Jul 18, 2021
@scd75
Copy link
Author

scd75 commented Jul 18, 2021

Hi,

in all examples above, the expected interpolated value is "-50":

Example 1
            y
x
-79    -200.0
-79     -50.0
 0       -50.0
 61     -50.0
 2783  1000.0

Example 2
            y
x
 79    -200.0
 79     -50.0
 0       -50.0
-61     -50.0
-2783  1000.0
     
Example 3
         y
x
 79 -200.0
 79  -50.0
 0    -50.0
-61  -50.0
-61  -50.0
    
Example 4
         y
x
 79 -200.0
 79  -50.0
 0    -50.0
-61  -50.0
-61  -50.0
           

@scd75
Copy link
Author

scd75 commented Jul 27, 2021

Hi,
could you figure some explanation for this behaviour? Actually, I also find some other examples where "index" method behaves not as expected, while "slinear" does, but not able to produce a simple example here.
All in all, it seems that "slinear" gives the most consistent results, when applied to a df with an ascendingly sorted x-axis.

Please let me know if you need something else,

Thanks!

@bkurtz
Copy link

bkurtz commented Aug 14, 2021

Ran into this separately today with method="time" as well. Think I've found approximately the root cause and opened a PR, though looking through this I'm realizing that it's not clear my fix will address the issue when the index is in decreasing order.

@bkurtz
Copy link

bkurtz commented Aug 14, 2021

If I'm correct about the root cause, it would've been introduced by #29943

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 16, 2021
@mroeschke mroeschke added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants