Skip to content

resample interpolate gives unexpected results in 0.18.1 #14297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dershow opened this issue Sep 25, 2016 · 8 comments
Open

resample interpolate gives unexpected results in 0.18.1 #14297

dershow opened this issue Sep 25, 2016 · 8 comments
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method

Comments

@dershow
Copy link

dershow commented Sep 25, 2016

A small, complete example of the issue

a = pd.Series([1.,3.,4.,3.,5.,6.,7.,8.], ['2016-05-25 00:00:35','2016-05-25 00:00:50','2016-05-25 00:01:05','2016-05-25 00:01:35','2016-05-25 00:02:05','2016-05-25 00:03:00','2016-05-25 00:04:00','2016-05-25 00:06:00'])                                   

In [79]: a
Out[79]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:35    3.0
2016-05-25 00:02:05    5.0
2016-05-25 00:03:00    6.0
2016-05-25 00:04:00    7.0
2016-05-25 00:06:00    8.0
dtype: float64

In [80]: a.index = pd.to_datetime(a.index)

In [81]: a.resample('15S', base=5).interpolate()

Expected Output

I expect that I would get valid values, based on the input at 2:05 and later. It appears that the data data after 2:05 is ignored.

Output of pd.show_versions()

In [146]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.2
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

# Paste the output here

Out[81]:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:20 3.5
2016-05-25 00:01:35 3.0
2016-05-25 00:01:50 4.0
2016-05-25 00:02:05 5.0
2016-05-25 00:02:20 5.0
2016-05-25 00:02:35 5.0
2016-05-25 00:02:50 5.0
2016-05-25 00:03:05 5.0
2016-05-25 00:03:20 5.0
2016-05-25 00:03:35 5.0
2016-05-25 00:03:50 5.0
2016-05-25 00:04:05 5.0
2016-05-25 00:04:20 5.0
2016-05-25 00:04:35 5.0
2016-05-25 00:04:50 5.0
2016-05-25 00:05:05 5.0
2016-05-25 00:05:20 5.0
2016-05-25 00:05:35 5.0
2016-05-25 00:05:50 5.0
Freq: 15S, dtype: float64

I was told that this error does not show up in 0.18.0, but I have not confirmed that.

This comes from my attempts to interpolate some irregular data as shown in this question: https://stackoverflow.com/questions/39599192/fill-in-time-data-in-pandas

@chris-b1
Copy link
Contributor

This isn't exactly a regression from version 0.18.0 - in that version interpolate didn't exist as a resample method, so what you were actually getting was:

a.resample('15s', base=5).mean().interpolate()

In 0.18.1 interpolate was added, but does an asfreq upsample before interpolating - this upsample only grabs exact matches (times that fall exactly on a bin edge), so some of the data isn't used. There's definitely a case that something like the old behavior should be exposed.

a.resample('15s', base=5).asfreq()
Out[52]: 
2016-05-25 00:00:35    1.0
2016-05-25 00:00:50    3.0
2016-05-25 00:01:05    4.0
2016-05-25 00:01:20    NaN
2016-05-25 00:01:35    3.0
2016-05-25 00:01:50    NaN
2016-05-25 00:02:05    5.0
2016-05-25 00:02:20    NaN
2016-05-25 00:02:35    NaN
2016-05-25 00:02:50    NaN
2016-05-25 00:03:05    NaN
2016-05-25 00:03:20    NaN
2016-05-25 00:03:35    NaN
2016-05-25 00:03:50    NaN
2016-05-25 00:04:05    NaN
2016-05-25 00:04:20    NaN
2016-05-25 00:04:35    NaN
2016-05-25 00:04:50    NaN
2016-05-25 00:05:05    NaN
2016-05-25 00:05:20    NaN
2016-05-25 00:05:35    NaN
2016-05-25 00:05:50    NaN
Freq: 15S, dtype: float64

@chris-b1 chris-b1 added API Design Resample resample method labels Sep 25, 2016
@dershow
Copy link
Author

dershow commented Sep 26, 2016

Is there any way to get Pandas to give me a "best guess" at each resampled point in time where it doesn't have an exact value? I was hoping that interpolating this way would provide that "best guess", but it is not doing what I had hoped.

@chris-b1
Copy link
Contributor

One way to do this now is to do an "aggregating upsample" first, and then interpolate, which is what was happening in 0.18.

a.resample('15s', base=5).first().interpolate()

Out[76]: 
2016-05-25 00:00:35    1.000000
2016-05-25 00:00:50    3.000000
2016-05-25 00:01:05    4.000000
2016-05-25 00:01:20    3.500000
2016-05-25 00:01:35    3.000000
2016-05-25 00:01:50    4.000000
2016-05-25 00:02:05    5.000000
2016-05-25 00:02:20    5.333333
2016-05-25 00:02:35    5.666667
2016-05-25 00:02:50    6.000000
2016-05-25 00:03:05    6.250000
2016-05-25 00:03:20    6.500000
2016-05-25 00:03:35    6.750000
2016-05-25 00:03:50    7.000000
2016-05-25 00:04:05    7.125000
2016-05-25 00:04:20    7.250000
2016-05-25 00:04:35    7.375000
2016-05-25 00:04:50    7.500000
2016-05-25 00:05:05    7.625000
2016-05-25 00:05:20    7.750000
2016-05-25 00:05:35    7.875000
2016-05-25 00:05:50    8.000000
Freq: 15S, dtype: float64

@dershow
Copy link
Author

dershow commented Sep 26, 2016

But, that is shifting the data by 10s in the above example. We know that that "exact" value at 4:00 is 7.0 and at 3:50 is should be a little bit less (interpolating the input in time gives 6.8333). That's what I was hoping to get. Then, interpolating from that to get 4:20 etc.

@chris-b1
Copy link
Contributor

chris-b1 commented Sep 26, 2016

Try this (maybe this is what interpolate should do by default, interpolating before re-sampling?)

from scipy.interpolate import interp1d

# fit the interpolation in integer ns-space
f = interp1d(a.index.asi8, a.values)

# generating ending bins
dates = a.resample('15s', base=5).first().index

# apply
pd.Series(f(dates.asi8), dates)
Out[122]: 
2016-05-25 00:00:35    1.000000
2016-05-25 00:00:50    3.000000
2016-05-25 00:01:05    4.000000
2016-05-25 00:01:20    3.500000
2016-05-25 00:01:35    3.000000
2016-05-25 00:01:50    4.000000
2016-05-25 00:02:05    5.000000
2016-05-25 00:02:20    5.272727
2016-05-25 00:02:35    5.545455
2016-05-25 00:02:50    5.818182
2016-05-25 00:03:05    6.083333
2016-05-25 00:03:20    6.333333
2016-05-25 00:03:35    6.583333
2016-05-25 00:03:50    6.833333
2016-05-25 00:04:05    7.041667
2016-05-25 00:04:20    7.166667
2016-05-25 00:04:35    7.291667
2016-05-25 00:04:50    7.416667
2016-05-25 00:05:05    7.541667
2016-05-25 00:05:20    7.666667
2016-05-25 00:05:35    7.791667
2016-05-25 00:05:50    7.916667
Freq: 15S, dtype: float64

@dershow
Copy link
Author

dershow commented Sep 26, 2016

Yes, that is just what I was expecting it would do, and is just what I was looking for. It sure seems more obvious to me that "interpolating at a given frequency" should mean "fill in the missing data points using interpolation" Which is just what the example above now does.
Thank you!

@kdebrab
Copy link
Contributor

kdebrab commented Mar 9, 2017

FYI: one obtains the same correct result by interpolating on the union of old and new index before resampling to the new index:

# first obtain the desired new index
newindex = a.resample('15S', base=5).asfreq().index

# interpolate on union of old and new index
a_union = a.reindex(a.index.union(newindex)).interpolate(method='time')

# reindex to the new index
a_union.reindex(newindex)

Out[41]: 
2016-05-25 00:00:35    1.000000
2016-05-25 00:00:50    3.000000
2016-05-25 00:01:05    4.000000
2016-05-25 00:01:20    3.500000
2016-05-25 00:01:35    3.000000
2016-05-25 00:01:50    4.000000
2016-05-25 00:02:05    5.000000
2016-05-25 00:02:20    5.272727
2016-05-25 00:02:35    5.545455
2016-05-25 00:02:50    5.818182
2016-05-25 00:03:05    6.083333
2016-05-25 00:03:20    6.333333
2016-05-25 00:03:35    6.583333
2016-05-25 00:03:50    6.833333
2016-05-25 00:04:05    7.041667
2016-05-25 00:04:20    7.166667
2016-05-25 00:04:35    7.291667
2016-05-25 00:04:50    7.416667
2016-05-25 00:05:05    7.541667
2016-05-25 00:05:20    7.666667
2016-05-25 00:05:35    7.791667
2016-05-25 00:05:50    7.916667
Freq: 15S, dtype: float64

@jreback
Copy link
Contributor

jreback commented Mar 9, 2017

@kdebrab

so the reason this happens is because the index is first reindexed to the new time buckets (upsampled) via reindexing, then interpolation happens. So w/o filling (which doesn't happen here), the points that are not on the new interval are dropped.

Since the point of interpolation is neither filling, nor dropping (rather its interpolation), then this is not correct.

If you made this change what breaks in our current test suite?

@jreback jreback added this to the Next Major Release milestone Mar 9, 2017
@jreback jreback added Difficulty Intermediate Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Mar 9, 2017
@mroeschke mroeschke added the Bug label May 11, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method
Projects
None yet
Development

No branches or pull requests

6 participants