-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
resample interpolate gives unexpected results in 0.18.1 #14297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This isn't exactly a regression from version a.resample('15s', base=5).mean().interpolate() In a.resample('15s', base=5).asfreq()
Out[52]:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:20 NaN
2016-05-25 00:01:35 3.0
2016-05-25 00:01:50 NaN
2016-05-25 00:02:05 5.0
2016-05-25 00:02:20 NaN
2016-05-25 00:02:35 NaN
2016-05-25 00:02:50 NaN
2016-05-25 00:03:05 NaN
2016-05-25 00:03:20 NaN
2016-05-25 00:03:35 NaN
2016-05-25 00:03:50 NaN
2016-05-25 00:04:05 NaN
2016-05-25 00:04:20 NaN
2016-05-25 00:04:35 NaN
2016-05-25 00:04:50 NaN
2016-05-25 00:05:05 NaN
2016-05-25 00:05:20 NaN
2016-05-25 00:05:35 NaN
2016-05-25 00:05:50 NaN
Freq: 15S, dtype: float64 |
Is there any way to get Pandas to give me a "best guess" at each resampled point in time where it doesn't have an exact value? I was hoping that interpolating this way would provide that "best guess", but it is not doing what I had hoped. |
One way to do this now is to do an "aggregating upsample" first, and then interpolate, which is what was happening in 0.18.
|
But, that is shifting the data by 10s in the above example. We know that that "exact" value at 4:00 is 7.0 and at 3:50 is should be a little bit less (interpolating the input in time gives 6.8333). That's what I was hoping to get. Then, interpolating from that to get 4:20 etc. |
Try this (maybe this is what interpolate should do by default, interpolating before re-sampling?)
|
Yes, that is just what I was expecting it would do, and is just what I was looking for. It sure seems more obvious to me that "interpolating at a given frequency" should mean "fill in the missing data points using interpolation" Which is just what the example above now does. |
FYI: one obtains the same correct result by interpolating on the union of old and new index before resampling to the new index:
|
so the reason this happens is because the index is first reindexed to the new time buckets (upsampled) via reindexing, then interpolation happens. So w/o filling (which doesn't happen here), the points that are not on the new interval are dropped. Since the point of interpolation is neither filling, nor dropping (rather its interpolation), then this is not correct. If you made this change what breaks in our current test suite? |
A small, complete example of the issue
Expected Output
I expect that I would get valid values, based on the input at 2:05 and later. It appears that the data data after 2:05 is ignored.
Output of
pd.show_versions()
In [146]: pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.2
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 1.0b3
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
Out[81]:
2016-05-25 00:00:35 1.0
2016-05-25 00:00:50 3.0
2016-05-25 00:01:05 4.0
2016-05-25 00:01:20 3.5
2016-05-25 00:01:35 3.0
2016-05-25 00:01:50 4.0
2016-05-25 00:02:05 5.0
2016-05-25 00:02:20 5.0
2016-05-25 00:02:35 5.0
2016-05-25 00:02:50 5.0
2016-05-25 00:03:05 5.0
2016-05-25 00:03:20 5.0
2016-05-25 00:03:35 5.0
2016-05-25 00:03:50 5.0
2016-05-25 00:04:05 5.0
2016-05-25 00:04:20 5.0
2016-05-25 00:04:35 5.0
2016-05-25 00:04:50 5.0
2016-05-25 00:05:05 5.0
2016-05-25 00:05:20 5.0
2016-05-25 00:05:35 5.0
2016-05-25 00:05:50 5.0
Freq: 15S, dtype: float64
I was told that this error does not show up in 0.18.0, but I have not confirmed that.
This comes from my attempts to interpolate some irregular data as shown in this question: https://stackoverflow.com/questions/39599192/fill-in-time-data-in-pandas
The text was updated successfully, but these errors were encountered: