Skip to content

Upsampling a time-series is missing an option to properly deal with the end #10449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
filmor opened this issue Jun 26, 2015 · 10 comments
Open
Labels
Enhancement Period Period data type Resample resample method

Comments

@filmor
Copy link
Contributor

filmor commented Jun 26, 2015

Consider the following:

index = pd.date_range("2015-01-01", "2015-01-01 02:00", freq="1H", closed="left")
s = pd.Series(1, index=index)
s.resample("15T", fill_method="ffill", closed="left")

The result looks like this

2015-01-01 00:00:00    1
2015-01-01 00:15:00    1
2015-01-01 00:30:00    1
2015-01-01 00:45:00    1
2015-01-01 01:00:00    1
Freq: 15T, dtype: int64

What I actually want is

2015-01-01 00:00:00    1
2015-01-01 00:15:00    1
2015-01-01 00:30:00    1
2015-01-01 00:45:00    1
2015-01-01 01:00:00    1
2015-01-01 01:15:00    1
2015-01-01 01:30:00    1
2015-01-01 01:45:00    1
Freq: 15T, dtype: int64

Currently it seems that you always have to do the resampling yourself by creating a new index of the new frequency from the same begin and end values, reindexing, and forward filling.

Actually, I would have expected closed to work like this. Any hints on a reasonable parameter so I can try to prepare a PR for this?

@kawochen
Copy link
Contributor

Ah. I misspoke earlier. The two closed=left doesn't seem to mean the same thing for the right end point. The first one drops it, and the second one preserves it. Maybe just make them consistent, instead of adding a parameter?

@dagru
Copy link

dagru commented Jun 30, 2015

I don't think consistency is what you want here. That would mean that you shorten the Series even more, ending at 2015-01-01 00:45:00 instead of the expected 2015-01-01 01:45:00.

What I would expect a priori from pd.date_range(closed = "left") is a representation of the interval, excluding its right endpoint, like
x___x___x___
such that an upsampling without any further options would lead to
x_x_x_x_x_x_
I don't know if that is easily possible, but at least you should have the possibility to remind it that the interval is right-open when you resample.

What date_range [or Series?] does is not only forgetting about the endpoint, but the complete last interval
x___x___x___ (what I think it should do)
x___x___x (what it does)
such that an upsampling, no matter if closed or open leads to
x_x_x_x_x

As far as I can see, you can only get it to go one step over the old endpoint if the new frequency is not a multiple of the old one, like resample("45T") gets you
x___x___x (original)
x__x__x__x (closed = None)
x__x__x (closed = "left")

But if the new frequency is a multiple of the old one, it will always end at the last endpoint, no matter which option I tried.

@filmor
Copy link
Contributor Author

filmor commented Jun 30, 2015

That is exactly what I mean :)

In our context (energy), the timestamps almost always represent the beginning of an interval. In other cases it represents the end but that fails equally, I think.

@jreback jreback added the Resample resample method label Jun 30, 2015
@jreback
Copy link
Contributor

jreback commented Jun 30, 2015

@filmor I am not sure why you think this should go to the 2 hour ever. Its not included in the sample.

you can simply do this:

In [20]: s.reindex(pd.date_range("2015-01-01", "2015-01-01 02:00", freq='15T'), method='ffill')
Out[20]: 
2015-01-01 00:00:00    1
2015-01-01 00:15:00    1
2015-01-01 00:30:00    1
2015-01-01 00:45:00    1
2015-01-01 01:00:00    1
2015-01-01 01:15:00    1
2015-01-01 01:30:00    1
2015-01-01 01:45:00    1
2015-01-01 02:00:00    1
Freq: 15T, dtype: int64

@filmor
Copy link
Contributor Author

filmor commented Jun 30, 2015

Yeah, for that I have to remember the end-points. But by having an hourly timeseries with values for each point in time I already indicate (and that's what forward-fill does correctly for all points but the last one) that each timestamp represents the start of a one-hour interval.

@dagru
Copy link

dagru commented Jul 1, 2015

"I am not sure why you think this should go to the 2 hour ever. "

Because I expect(ed) pd.date_range("2015-01-01", "2015-01-01 02:00", freq="1H", closed="left") to be an object that models the range from 00:00 until 02:00 (excluded), meaning that all timestamps until 02:00-\eps are in the range. If you sample with 1H frequency, that happens to stop at 01:00, but that doesn't mean that 01:45 is not in the range I supposed to cover. Therefore I expected to get Timestamps between 01:00 and 02:00 back if I resample to a higher frequency, or at least have an option for that in .resample, which I seem to not have.

For how it looks to me,
pd.date_range("2015-01-01", "2015-01-01 02:00", freq="1H", closed="left")
is exactly the same as
pd.date_range("2015-01-01", "2015-01-01 01:00", freq="1H")

but you usually have your reasons why you go one step longer with on open ending, as opposed to stopping already one step earlier with closed ending.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2015

pd.date_range("2015-01-01", "2015-01-01 02:00", freq="1H", closed="left")

stops at 1, so not sure how that should somehow magically go to 2. A closed/open right hand interval would generally include/exclude a single right hand point (e.g. the 1 in this case).

If you want to upsample to 2, then simply reindex it in the first place. Resample is already way magically, this would add another layer.

All that said if you think that you can find a reasonable api that preserves back-compat. go for it.

@decatur
Copy link

decatur commented Jun 9, 2017

It took me hours to land here. My understanding is that
index = pd.date_range("2015-01-01", "2015-01-01 01:00", freq="1H")
represents len(index)=2 hourly slots, increasingly so if you replace date_range by period_range
So I did expect that
s = pd.Series([1,2], index=index)
s.resample("15T").pad()
has 2*4=8 slots, not 5. Having to do
s.reindex((pd.date_range(index[0], index[-1]+1, freq='15T')-1)[1:], method='pad')
to get
''
2015-01-01 00:00:00 1
2015-01-01 00:15:00 1
2015-01-01 00:30:00 1
2015-01-01 00:45:00 1
2015-01-01 01:00:00 2
2015-01-01 01:15:00 2
2015-01-01 01:30:00 2
2015-01-01 01:45:00 2
''
is pretty much a workaround. Btw I am also into energy.

@winklerand
Copy link
Contributor

@decatur
I think this basically boils down to the difference between DatetimeIndex and PeriodIndex - and one current shortcoming of resampling PeriodIndex.

My understanding is that
index = pd.date_range("2015-01-01", "2015-01-01 01:00", freq="1H")
represents len(index)=2 hourly slots

In my understanding, the DatetimeIndex does represents two points in time which are 1h apart - but it does not carry the notion of a time span, duration or "slots". Therefore, the last datetime in the index ("2015-01-01 01:00") is just a time instant which is not going to be upsampled/extended to 4 "sub-periods".

PeriodIndex would be the right fit to represent time spans - currently, resampling does not work properly when called with frequency multiples such as freq='15T' (freq='T' works fine). I opened issue #15944 and tried to fix it in PR #16153. Unfortunately, I was busy with other stuff - trying to pick up on that one again in the coming days.

Btw: energy data here as well ;-)

@jreback jreback added Difficulty Intermediate Frequency DateOffsets Period Period data type labels Jun 9, 2017
@jreback
Copy link
Contributor

jreback commented Jun 9, 2017

yeah @winklerand soln is the right one here (along with some docs) for doing this.

@jreback jreback added this to the Next Major Release milestone Jun 9, 2017
@mroeschke mroeschke added Enhancement and removed Frequency DateOffsets labels May 11, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Period Period data type Resample resample method
Projects
None yet
Development

No branches or pull requests

8 participants