resample interpolate gives unexpected results in 0.18.1 #14297

dershow · 2016-09-25T13:03:43Z

A small, complete example of the issue

a = pd.Series([1.,3.,4.,3.,5.,6.,7.,8.], ['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05','2016-05-25 00:03:00','2016-05-25 00:04:00','2016-05-25 00:06:00'])                                   

In [79]: a
Out[79]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:35    3.0
2016-05-25 00:02:05    5.0
2016-05-25 00:03:00    6.0
2016-05-25 00:04:00    7.0
2016-05-25 00:06:00    8.0
dtype: float64

In [80]: a.index = pd.to_datetime(a.index)

In [81]: a.resample('15S', base=5).interpolate()

Expected Output

I expect that I would get valid values, based on the input at 2:05 and later. It appears that the data data after 2:05 is ignored.

Output of `pd.show_versions()`

In [146]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.2
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

# Paste the output here

Out[81]:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:20 3.5
2016-05-25 00:01:35 3.0
2016-05-25 00:01:50 4.0
2016-05-25 00:02:05 5.0
2016-05-25 00:02:20 5.0
2016-05-25 00:02:35 5.0
2016-05-25 00:02:50 5.0
2016-05-25 00:03:05 5.0
2016-05-25 00:03:20 5.0
2016-05-25 00:03:35 5.0
2016-05-25 00:03:50 5.0
2016-05-25 00:04:05 5.0
2016-05-25 00:04:20 5.0
2016-05-25 00:04:35 5.0
2016-05-25 00:04:50 5.0
2016-05-25 00:05:05 5.0
2016-05-25 00:05:20 5.0
2016-05-25 00:05:35 5.0
2016-05-25 00:05:50 5.0
Freq: 15S, dtype: float64

I was told that this error does not show up in 0.18.0, but I have not confirmed that.

This comes from my attempts to interpolate some irregular data as shown in this question: https://stackoverflow.com/questions/39599192/fill-in-time-data-in-pandas

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-09-25T13:51:00Z

This isn't exactly a regression from version 0.18.0 - in that version interpolate didn't exist as a resample method, so what you were actually getting was:

a.resample('15s', base=5).mean().interpolate()

In 0.18.1 interpolate was added, but does an asfreq upsample before interpolating - this upsample only grabs exact matches (times that fall exactly on a bin edge), so some of the data isn't used. There's definitely a case that something like the old behavior should be exposed.

a.resample('15s', base=5).asfreq()
Out[52]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:20    NaN
2016-05-25 00:01:35    3.0
2016-05-25 00:01:50    NaN
2016-05-25 00:02:05    5.0
2016-05-25 00:02:20    NaN
2016-05-25 00:02:35    NaN
2016-05-25 00:02:50    NaN
2016-05-25 00:03:05    NaN
2016-05-25 00:03:20    NaN
2016-05-25 00:03:35    NaN
2016-05-25 00:03:50    NaN
2016-05-25 00:04:05    NaN
2016-05-25 00:04:20    NaN
2016-05-25 00:04:35    NaN
2016-05-25 00:04:50    NaN
2016-05-25 00:05:05    NaN
2016-05-25 00:05:20    NaN
2016-05-25 00:05:35    NaN
2016-05-25 00:05:50    NaN
Freq: 15S, dtype: float64

dershow · 2016-09-26T13:09:02Z

Is there any way to get Pandas to give me a "best guess" at each resampled point in time where it doesn't have an exact value? I was hoping that interpolating this way would provide that "best guess", but it is not doing what I had hoped.

chris-b1 · 2016-09-26T13:34:56Z

One way to do this now is to do an "aggregating upsample" first, and then interpolate, which is what was happening in 0.18.

a.resample('15s', base=5).first().interpolate()

Out[76]: 
2016-05-25 00:00:35    1.000000
2016-05-25 00:00:50    3.000000
2016-05-25 00:01:05    4.000000
2016-05-25 00:01:20    3.500000
2016-05-25 00:01:35    3.000000
2016-05-25 00:01:50    4.000000
2016-05-25 00:02:05    5.000000
2016-05-25 00:02:20    5.333333
2016-05-25 00:02:35    5.666667
2016-05-25 00:02:50    6.000000
2016-05-25 00:03:05    6.250000
2016-05-25 00:03:20    6.500000
2016-05-25 00:03:35    6.750000
2016-05-25 00:03:50    7.000000
2016-05-25 00:04:05    7.125000
2016-05-25 00:04:20    7.250000
2016-05-25 00:04:35    7.375000
2016-05-25 00:04:50    7.500000
2016-05-25 00:05:05    7.625000
2016-05-25 00:05:20    7.750000
2016-05-25 00:05:35    7.875000
2016-05-25 00:05:50    8.000000
Freq: 15S, dtype: float64

dershow · 2016-09-26T16:54:09Z

But, that is shifting the data by 10s in the above example. We know that that "exact" value at 4:00 is 7.0 and at 3:50 is should be a little bit less (interpolating the input in time gives 6.8333). That's what I was hoping to get. Then, interpolating from that to get 4:20 etc.

chris-b1 · 2016-09-26T17:31:56Z

Try this (maybe this is what interpolate should do by default, interpolating before re-sampling?)

from scipy.interpolate import interp1d

# fit the interpolation in integer ns-space
f = interp1d(a.index.asi8, a.values)

# generating ending bins
dates = a.resample('15s', base=5).first().index

# apply
pd.Series(f(dates.asi8), dates)
Out[122]: 
2016-05-25 00:00:35    1.000000
2016-05-25 00:00:50    3.000000
2016-05-25 00:01:05    4.000000
2016-05-25 00:01:20    3.500000
2016-05-25 00:01:35    3.000000
2016-05-25 00:01:50    4.000000
2016-05-25 00:02:05    5.000000
2016-05-25 00:02:20    5.272727
2016-05-25 00:02:35    5.545455
2016-05-25 00:02:50    5.818182
2016-05-25 00:03:05    6.083333
2016-05-25 00:03:20    6.333333
2016-05-25 00:03:35    6.583333
2016-05-25 00:03:50    6.833333
2016-05-25 00:04:05    7.041667
2016-05-25 00:04:20    7.166667
2016-05-25 00:04:35    7.291667
2016-05-25 00:04:50    7.416667
2016-05-25 00:05:05    7.541667
2016-05-25 00:05:20    7.666667
2016-05-25 00:05:35    7.791667
2016-05-25 00:05:50    7.916667
Freq: 15S, dtype: float64

dershow · 2016-09-26T18:35:29Z

Yes, that is just what I was expecting it would do, and is just what I was looking for. It sure seems more obvious to me that "interpolating at a given frequency" should mean "fill in the missing data points using interpolation" Which is just what the example above now does.
Thank you!

kdebrab · 2017-03-09T17:55:08Z

FYI: one obtains the same correct result by interpolating on the union of old and new index before resampling to the new index:

# first obtain the desired new index
newindex = a.resample('15S', base=5).asfreq().index

# interpolate on union of old and new index
a_union = a.reindex(a.index.union(newindex)).interpolate(method='time')

# reindex to the new index
a_union.reindex(newindex)

Out[41]: 
2016-05-25 00:00:35    1.000000
2016-05-25 00:00:50    3.000000
2016-05-25 00:01:05    4.000000
2016-05-25 00:01:20    3.500000
2016-05-25 00:01:35    3.000000
2016-05-25 00:01:50    4.000000
2016-05-25 00:02:05    5.000000
2016-05-25 00:02:20    5.272727
2016-05-25 00:02:35    5.545455
2016-05-25 00:02:50    5.818182
2016-05-25 00:03:05    6.083333
2016-05-25 00:03:20    6.333333
2016-05-25 00:03:35    6.583333
2016-05-25 00:03:50    6.833333
2016-05-25 00:04:05    7.041667
2016-05-25 00:04:20    7.166667
2016-05-25 00:04:35    7.291667
2016-05-25 00:04:50    7.416667
2016-05-25 00:05:05    7.541667
2016-05-25 00:05:20    7.666667
2016-05-25 00:05:35    7.791667
2016-05-25 00:05:50    7.916667
Freq: 15S, dtype: float64

jreback · 2017-03-09T18:18:38Z

@kdebrab

so the reason this happens is because the index is first reindexed to the new time buckets (upsampled) via reindexing, then interpolation happens. So w/o filling (which doesn't happen here), the points that are not on the new interval are dropped.

Since the point of interpolation is neither filling, nor dropping (rather its interpolation), then this is not correct.

If you made this change what breaks in our current test suite?

chris-b1 added API Design Resample resample method labels Sep 25, 2016

jreback added this to the Next Major Release milestone Mar 9, 2017

jreback added Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 9, 2017

den-run-ai mentioned this issue Aug 6, 2017

pandas resample weekly and interpolate - wrong results #16381

Open

Make42 mentioned this issue Nov 10, 2017

BUG: (linear) interpolation after resampling #18189

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added the Bug label May 11, 2020

mroeschke removed the API Design label May 1, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

kopytjuk mentioned this issue Mar 25, 2023

DOC warn user about potential information loss in Resampler.interpolate #52198

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resample interpolate gives unexpected results in 0.18.1 #14297

resample interpolate gives unexpected results in 0.18.1 #14297

dershow commented Sep 25, 2016

chris-b1 commented Sep 25, 2016

dershow commented Sep 26, 2016

chris-b1 commented Sep 26, 2016

dershow commented Sep 26, 2016

chris-b1 commented Sep 26, 2016 •

edited

Loading

dershow commented Sep 26, 2016

kdebrab commented Mar 9, 2017 •

edited

Loading

jreback commented Mar 9, 2017

resample interpolate gives unexpected results in 0.18.1 #14297

resample interpolate gives unexpected results in 0.18.1 #14297

Comments

dershow commented Sep 25, 2016

A small, complete example of the issue

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Sep 25, 2016

dershow commented Sep 26, 2016

chris-b1 commented Sep 26, 2016

dershow commented Sep 26, 2016

chris-b1 commented Sep 26, 2016 • edited Loading

dershow commented Sep 26, 2016

kdebrab commented Mar 9, 2017 • edited Loading

jreback commented Mar 9, 2017

Output of `pd.show_versions()`

chris-b1 commented Sep 26, 2016 •

edited

Loading

kdebrab commented Mar 9, 2017 •

edited

Loading