Skip to content

pandas resample weekly and interpolate - wrong results #16381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
den-run-ai opened this issue May 17, 2017 · 11 comments
Open

pandas resample weekly and interpolate - wrong results #16381

den-run-ai opened this issue May 17, 2017 · 11 comments
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method

Comments

@den-run-ai
Copy link

den-run-ai commented May 17, 2017

import pandas as pd
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: 2.9.2
pip: 8.1.2
setuptools: 34.4.1
Cython: 0.24.1
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.6.1
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.7.2
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
pd.date_range("1900/01/01","1900/12/31",freq='M')
DatetimeIndex(['1900-01-31', '1900-02-28', '1900-03-31', '1900-04-30',
               '1900-05-31', '1900-06-30', '1900-07-31', '1900-08-31',
               '1900-09-30', '1900-10-31', '1900-11-30', '1900-12-31'],
              dtype='datetime64[ns]', freq='M')
pdtest=pd.DataFrame(data=list(range(12,0,-1)),index=pd.date_range("1900/01/01","1900/12/31",freq='M'))
pdtest
0
1900-01-31 12
1900-02-28 11
1900-03-31 10
1900-04-30 9
1900-05-31 8
1900-06-30 7
1900-07-31 6
1900-08-31 5
1900-09-30 4
1900-10-31 3
1900-11-30 2
1900-12-31 1
pdtest.resample('D').interpolate()[:15]
0
1900-01-31 12.000000
1900-02-01 11.964286
1900-02-02 11.928571
1900-02-03 11.892857
1900-02-04 11.857143
1900-02-05 11.821429
1900-02-06 11.785714
1900-02-07 11.750000
1900-02-08 11.714286
1900-02-09 11.678571
1900-02-10 11.642857
1900-02-11 11.607143
1900-02-12 11.571429
1900-02-13 11.535714
1900-02-14 11.500000
pdtest.resample('W-MON').interpolate()
0
1900-02-05 NaN
1900-02-12 NaN
1900-02-19 NaN
1900-02-26 NaN
1900-03-05 NaN
1900-03-12 NaN
1900-03-19 NaN
1900-03-26 NaN
1900-04-02 NaN
1900-04-09 NaN
1900-04-16 NaN
1900-04-23 NaN
1900-04-30 9.000000
1900-05-07 8.771429
1900-05-14 8.542857
1900-05-21 8.314286
1900-05-28 8.085714
1900-06-04 7.857143
1900-06-11 7.628571
1900-06-18 7.400000
1900-06-25 7.171429
1900-07-02 6.942857
1900-07-09 6.714286
1900-07-16 6.485714
1900-07-23 6.257143
1900-07-30 6.028571
1900-08-06 5.800000
1900-08-13 5.571429
1900-08-20 5.342857
1900-08-27 5.114286
1900-09-03 4.885714
1900-09-10 4.657143
1900-09-17 4.428571
1900-09-24 4.200000
1900-10-01 3.971429
1900-10-08 3.742857
1900-10-15 3.514286
1900-10-22 3.285714
1900-10-29 3.057143
1900-11-05 2.828571
1900-11-12 2.600000
1900-11-19 2.371429
1900-11-26 2.142857
1900-12-03 1.914286
1900-12-10 1.685714
1900-12-17 1.457143
1900-12-24 1.228571
1900-12-31 1.000000
@jreback
Copy link
Contributor

jreback commented May 17, 2017

pls replace the top of the issue with a copy-pastable example and pd.show_versions() as indicated in the issue request page.

@den-run-ai
Copy link
Author

@jreback done!

@jreback
Copy link
Contributor

jreback commented May 18, 2017

can you remove the rendered frames. simply run this in ipython and paste the results.

@den-run-ai
Copy link
Author

@jreback what is wrong with frames? i don't work with pandas in ipython terminal.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 18, 2017

@denfromufa can you post your expected output?

I think you're getting tripped up by the endpoints. When you do pdtest.resample('W-MON').interpolate(), the array is upsampled

In [42]: pdtest.resample("W-MON")._upsample(None).head()
Out[42]:
             0
1900-02-05 NaN
1900-02-12 NaN
1900-02-19 NaN
1900-02-26 NaN
1900-03-05 NaN

and then interpolated.

Since the original left-endpoint doesn't align with a W-MON freq, you get NaN and then .interpolate will be NaN for everything before the first valid (upsampled) observation.

If you resample at a freq that does align with your original first point, it will be I think what you expect:

In [43]: pdtest.resample("W-WED")._upsample(None).head()
Out[43]:
               0
1900-01-31  12.0
1900-02-07   NaN
1900-02-14   NaN
1900-02-21   NaN
1900-02-28  11.0

In [45]: pdtest.resample("W-WED")._upsample(None).interpolate().head()
Out[45]:
                0
1900-01-31  12.00
1900-02-07  11.75
1900-02-14  11.50
1900-02-21  11.25
1900-02-28  11.00

@den-run-ai
Copy link
Author

den-run-ai commented May 18, 2017

@TomAugspurger this is good explanation, but I expected interpolation even for mis-aligned data. I think for weekly interpolation the safest option to use is like this:

pdtest.resample('D').interpolate()[::7]

But the most and only upvoted answer on SO suggests what I did originally:
http://stackoverflow.com/a/14531149/2230844

Anyway I'm having even a bigger problem with original weekly interpolation method, let me open another issue for it.

@TomAugspurger
Copy link
Contributor

Agreed that it's a surprising output, unless your familiar with how it's implemented. I'm not sure there's much we can do though... Potentially we could fill the endpoints of the upsampled DataFrame with the original endpoints?

# would have to handle dataframe's properly, but this is the main idea
In [34]: up.squeeze().fillna({up.index[0]: pdtest.iloc[0, 0]}).interpolate().head()
Out[34]:
1900-02-04    12.000000
1900-02-11    11.764706
1900-02-18    11.529412
1900-02-25    11.294118
1900-03-04    11.058824
Freq: W-SUN, Name: 0, dtype: float64

In [35]: w = pdtest.resample("W")

In [36]: up = w._upsample(None)

In [37]: up.squeeze().fillna({up.index[0]: pdtest.iloc[0, 0]}).interpolate().head()
Out[37]:
1900-02-04    12.000000
1900-02-11    11.764706
1900-02-18    11.529412
1900-02-25    11.294118
1900-03-04    11.058824
Freq: W-SUN, Name: 0, dtype: float64

we would want to look at if that breaks in violations upsampling.

@den-run-ai
Copy link
Author

den-run-ai commented May 18, 2017

@TomAugspurger i think one problem here is that syntax with keyword for fill_method='interpolate' is deprecated in .resample(). But the resampling can be dependent on the method like in this case.

@den-run-ai
Copy link
Author

related?

#14297

@eromoe
Copy link

eromoe commented Nov 12, 2018

I found this problem too... when it can be fixed?

@toobaz toobaz added Datetime Datetime data dtype Resample resample method labels Jan 7, 2019
@den-run-ai
Copy link
Author

@eromoe here is a workaround:

https://stackoverflow.com/a/44053092/2230844

@mroeschke mroeschke added the Bug label Mar 31, 2020
@mroeschke mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jun 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Resample resample method
Projects
None yet
Development

No branches or pull requests

6 participants