Skip to content

API: Add Calendar Day Frequency ("CD") #22274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mroeschke opened this issue Aug 10, 2018 · 31 comments · Fixed by #22288
Closed

API: Add Calendar Day Frequency ("CD") #22274

mroeschke opened this issue Aug 10, 2018 · 31 comments · Fixed by #22288
Labels
API Design Blocker Blocking issue or pull request for an upcoming release Frequency DateOffsets
Milestone

Comments

@mroeschke
Copy link
Member

This is a proposal to replace #20633.

offsets.Day ('D') is documented to represent calendar day; however, Day arithmetic and usage of the 'D' offset alias (e.g. df.resample('D'), timedelta_range(..., freq='D'), etc.) currently respects absolute time (i.e. Day acts like Timedelta(days=1)). This usage is ingrained pretty deeply in a lot of methods and operations and is fairly difficult to walk back to respect the notion of calendar day as stated in the docs.

Instead, I propose to keep Day as is (a timedelta-like frequency like Hour ('H')) and add a new frequency CalendarDay, 'CD'. It would act very similarly to DateOffset(day=1) but act like a frequency. Thoughts?

@jbrockmendel jbrockmendel changed the title API: Add Calandar Day Frequency ("CD") API: Add Calendar Day Frequency ("CD") Aug 11, 2018
@chris-b1
Copy link
Contributor

Only partially related, but I've also thought it would be nice to have frequencies for other DateOffset cases, e.g. a calendar year like the below.

In [7]: pd.date_range('2014-01-15', periods=10, freq=pd.DateOffset(years=1))
Out[7]: 
DatetimeIndex(['2014-01-15', '2015-01-15', '2016-01-15', '2017-01-15',
               '2018-01-15', '2019-01-15', '2020-01-15', '2021-01-15',
               '2022-01-15', '2023-01-15'],
              dtype='datetime64[ns]', freq='<DateOffset: kwds={'years': 1}>')

So, maybe a whole set of 'CD', 'CM', 'CA'?

@jreback jreback added this to the 0.24.0 milestone Aug 22, 2018
@jorisvandenbossche
Copy link
Member

e.g. df.resample('D') ... currently respects absolute time

@mroeschke can you give an example of that?

For example the below has a DST transition, but it keeps calendar days:

In [17]: pd.__version__
Out[17]: '0.23.4'

In [18]: idx = pd.date_range("2016-10-30", freq='H', periods=4*24, tz='Europe/Helsinki')

In [20]: pd.Series(index=idx).resample('D').count()
Out[20]: 
2016-10-30 00:00:00+03:00    0
2016-10-31 00:00:00+02:00    0
2016-11-01 00:00:00+02:00    0
2016-11-02 00:00:00+02:00    0
Freq: D, dtype: int64

@mroeschke
Copy link
Member Author

Ah yes, you're correct @jorisvandenbossche. I had misspoke; df.resample('D') did respect calendar day.

Basically the crux of the issue is that Day ('D') is a subclass of Tick. Ticks act like Timedeltas and their arithmetic respects absolute time, but it's special cased in some operations like date_range and resample to respect calendar time:

In [1]: ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')

In [2]: pd.date_range(start=ts, freq='D', periods=3)
Out[2]:
DatetimeIndex(['2016-10-30 00:00:00+03:00', '2016-10-31 00:00:00+02:00',
               '2016-11-01 00:00:00+02:00'],
              dtype='datetime64[ns, Europe/Helsinki]', freq='D')

In [3]: ts + pd.tseries.frequencies.to_offset('D')
Out[3]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

So I think this is confusing, inconsistent to users, and the codebase has special caseslike if isinstance(offset, Day): #break Tick arithmetic.

It's pretty difficult to convert 'D' to just respect calendar day across the board because 'D' it's fully ingrained with timedeltas (absolute time) as well. Therefore, operations like timedelta_range(..., freq='D') would break if 'D' was converted to a calendar day i.e adding a calendar day to and absolute day doesn't make much sense.

Overall I think it's easier and cleaner to make 'D' respect absolute time across the board and create a new CalendarDay, 'CD', offset

@jorisvandenbossche jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Sep 7, 2018
@jorisvandenbossche
Copy link
Member

So I think this is confusing, inconsistent to users

I don't debate the inconsistency, and that it would be nice to clean this up.

But there might be other ways to solve that.
In which cases does 'D' actually behave as 24H tick? Only when explicitly using an offset object like in your example? (or are there other cases? it would be good to better document this in this issue)
I think the cases where it is used as calendar day (resample, date_range) are more common, so we could also think about changing it the other way around?

timedelta_range(..., freq='D') would break if 'D' was converted to a calendar day

Yep, but we could say that 'D' is here interpreted as 24H, as timedelta's cannot hold a non-fixed frequency anyway


The current PR breaks (all in case of tz-aware data of course):

  • the result of date_range
  • the result of resample('D') (both the index and the values): the result are no longer daily values
  • plotting of a result of any of the above (since the frequency is not regular any more, it uses a different code path)

IMO that impact is too big to be justifiable, but that's of course open for discussion. But I think we can certainly not do such a change without proper deprecations.

@mroeschke
Copy link
Member Author

mroeschke commented Sep 8, 2018

Cases where 'D' behave as 24H tick:

  • Timestamp/DatetimeIndex Arithmetic
In [2]: ts = pd.Timestamp('2016-10-30 00:00:00', tz='Europe/Helsinki')

In [6]: ts + pd.tseries.frequencies.to_offset('D')
Out[6]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

In [7]: pd.DatetimeIndex([ts]) + pd.tseries.frequencies.to_offset('D')
Out[7]: DatetimeIndex(['2016-10-30 23:00:00+02:00'], dtype='datetime64[ns, Europe/Helsinki]', freq=None)
  • Shift
In [3]: dti = pd.date_range(start=ts, freq='D', periods=3)

In [4]: dti
Out[4]:
DatetimeIndex(['2016-10-30 00:00:00+03:00', '2016-10-31 00:00:00+02:00',
               '2016-11-01 00:00:00+02:00'],
              dtype='datetime64[ns, Europe/Helsinki]', freq='D')

In [5]: dti.shift(1, 'D')
Out[5]:
DatetimeIndex(['2016-10-30 23:00:00+02:00', '2016-10-31 23:00:00+02:00',
               '2016-11-01 23:00:00+02:00'],
              dtype='datetime64[ns, Europe/Helsinki]', freq='D')

And the only reason 'D' behaves like 1 calendar day in date_range (and therefore resample since date_range is used to generate bins) is because of this special casing which gives me the impression that Day is being forced to operate in a way it wasn't intended:

if hasattr(freq, 'delta') and freq != offsets.Day():

Also if 'D' were to follow a calendar day, I am not a big fan of it meaning 23-25 hours for datetimes with timezones and 24 hours for timedeltas. I would prefer if D was consistent across the board.

To ease the transition for date_range, the default frequency could be 'CD' instead of 'D' to still default to calendar day frequency. Sure, there's no easy transition for resample('D') for datetimetz data.

@jorisvandenbossche
Copy link
Member

because of this special casing which gives me the impression that Day is being forced to operate in a way it wasn't intended:

Or you could also say: the behavior as calendar day for date_range/resample was very much intentional because we explicitly special cased for it .. :)


The shift is a pitty .. but still, we could also opt to make that consistent with date_range/resample, instead of the other way around.

I personally do never use pd.tseries.frequencies.to_offset explicitly, or hardly ever shift, but I do use date_range and resample a lot. I would think many users have a similar pattern, but of course different use cases give different usage patterns, so hard to say.
But given my usage pattern, for me every time I use the string 'D', I would need to change it to 'CD', and I actually would almost never want 'D'. So then it seems a strange choice to opt for changing the common use cases of 'D', instead of the rare ones. But again, I don't know how typical my usage pattern is.

@mroeschke
Copy link
Member Author

the behavior as calendar day for date_range/resample was very much intentional because we explicitly special cased for it .. :)

Sure it was intentional, but it ultimately conflated the original meaning of 'D'. And I don't think a special case should force 'D' to change its original meaning.

My biggest -1 is the potential for 'D' to have calendar day semantics with timestamps and absolute day semantics with timedeltas. ('CD' cannot be currently added to timedeltas). Even if this would be documented, I am not a fan of the mixed semantics and feels fundamentally wrong.

For all intents and purposes, 'CD' will be equivalent to 'D' except for a timezone aware timeseries with a DST transition. I am not sure what subset of users this represents (I'd be very curious to know as well). If we release a v0.24rc, we can gauge if the community feels strongly about this API change.

I am open for date_range to default to 'CD' for continuity from prior behavior.

Any thoughts @jbrockmendel?

@jorisvandenbossche
Copy link
Member

My biggest -1 is the potential for 'D' to have calendar day semantics with timestamps and absolute day semantics with timedeltas.

Timedelta's have by definition no timezone, and for tz-naive data 'D' effectively means calendar day.
But of course when combining tz-aware timestamps with timedelta's, you indeed get 24H behaviour, and not calendar day. I agree that is not ideal.

timezone aware timeseries with a DST transition. I am not sure what subset of users this represents

I might not be the majority of our users, but certainly a significant subset. But for me the question is rather: of those users (the ones that use timezones), how many use explicit arithmetic with offsets or use shift, or how many use resample and date_range. And I think here the majority will be in the latter group, which is affected by this change.

If we release a v0.24rc, we can gauge if the community feels strongly about this API change.

That shouldn't wb the goal of an rc IMO. When we release an rc, it should be what we think is best, and to catch situations or breadth of impact we have not anticipated ourselves. Here I think we know quite well that some prominent use cases will break.


For me, the main issue I have: when having tz aware data, resample('D') will almost never be what users expect or want, I think. And needing to remember to do 'CD' instead of 'D' when having tz aware data, would be a big gotcha that we are adding.


To be clear, I think we should further discuss this (I am not necessarily tied to having 'D' be calendar day), but I object strongly to what is now in master. If we choose to go that route, it should at least be with a deprecation cycle.

@jorisvandenbossche
Copy link
Member

cc @pandas-dev/pandas-core I think this is worth a broader discussion than only Matt and me :)

Very short (probably too short and possibly biased) summary:

Problem:

The string 'D' means different things in different places:

  • to_offset('D') == Day() is a 24H tick offset, pd.to_timedelta(1, 'D') is a 24H timedelta. So both are strict 24H when using them in arithmetic, also when dealing with tz-aware data with a DST transition.
  • date_range(.., freq='D') and resample('D') special case 'D' to mean calendar day and not fixed 24H (to not get a shift in the hour when passing a DST transition)
  • DatetimeIndex(..).freq actually gives a Day() offset, even when the diff between all values is not all exactly 1 day.

PR merged in master

#22288 solved this inconsistency by changing 'D' to always mean 24H, and add CalendarDay() ('CD') to cover the calendar day use case.

The benefit is that this removed the inconsistency in interpretation of 'D' and the differences in arithmetic operations vs date_range/resample.

But this has the following consequences (all assuming you have tz-aware data, and that you have data covering at least one DST transition):

  • This breaks the output of date_range(..., freq='D')
  • This breaks the output of resample('D')
  • Both will not return a 'regular' frequency anymore, which also affects eg the plotting behaviour.

So the questions are: is solving this inconsistency worth such a break (and I think my opinion is clear here :)), or is there a possible deprecation path, or are there alternatives to solve the inconsistency?

@mroeschke
Copy link
Member Author

Not that bias at all @jorisvandenbossche :) thanks for the summary!

@TomAugspurger
Copy link
Contributor

I am open for date_range to default to 'CD' for continuity from prior behavior.

Does that go for .resample as well? Or do you think people would typically prefer the absolute version for that?

@mroeschke
Copy link
Member Author

resample does not have a default rule code; a user would need to specify 'D' or 'CD'

@jorisvandenbossche
Copy link
Member

do you think people would typically prefer the absolute version for that?

Apart from that there is no default we can change for resample, I personally can't think of a use case where you would ever want the absolute version (and even if you want that, I think '24H' would be much more explicit)

@TomAugspurger
Copy link
Contributor

Ah, of course.

One (probably bad option) is to special case the string "D" just for resample, and warn that it's treated as "CD". If people really want an absolute 24H, then they can use either "24H" or pd.tseries.offsets.Day(). Does that make any sense?

@jorisvandenbossche
Copy link
Member

One (probably bad option) is to special case the string "D" just for resample, and warn that it's treated as "CD". If people really want an absolute 24H, then they can use either "24H" or pd.tseries.offsets.Day().

That is basically the current situation, no?
(not to say that it is not an option, it could turn out to be the least worst of the possible solutions)

@mroeschke
Copy link
Member Author

Hmm that would be interesting depreciation cycle of df.resample('D'). Just so I understand, df.resample('D') would still act like df.resample('CD') but give a depreciation warning that 'D' will change behavior for resample in a later release? (if we detect tz-aware data of course)

@TomAugspurger
Copy link
Contributor

Right, so it would still be a breaking change for people doing df.resample(pd.tseries.offsets.Day()), but I don't know how common that is. I suspect it's less common than .resample("D") though.

The benefit of treating "D" as "CD", but immediately changing .offsets.Day() is that we don't have to introduce a keyword just for the deprecation.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 14, 2018

I am not sure Tom's latest suggestion was to deprecate ("warn" can of course point to that), or actually keep it as an alias.

If we keep the idea of the merged PR, that's the minimum we should do: deprecate first before actually changing behaviour. I don't think that would be hard: simply keep date_range and resample as they are, but raise a deprecation warning for 'D' in case there will be a difference in the future.

But for me, the question is still if we want that. For example, IMO it would be quite annoying that you need to remember to do resample('CD') instead of resample('D') just because you have tz-aware data.

@jorisvandenbossche
Copy link
Member

Right, so it would still be a breaking change for people doing df.resample(pd.tseries.offsets.Day()), but I don't know how common that is

Indeed less common, but we could still raise a similar deprecation warning for that case as well, no?
Or the reason you would not raise a deprecation warning in this case is because that would be hard to avoid for users who actually want to use Day() ? They can still use '24H' in the meantime, so I would personally be thorough with deprecating.


In my subjective assessment*, for the majority of our users, 'D' has always meant "calendar day", and only for a minor part it actually meant absolute time (the users that used offsets.Day() directly).
So my proposal would be to actually formalize what most people used: make 'D' calendar day, and deprecate the absolute behaviour of offset.Day().
This will probably be a more complicated deprecation, but one that will impact less users IMO (details need to be worked out of course, didn't think it fully through).

* any way we could check this? Do we know big downstream projects that use heavily tz-aware time series data?

@mroeschke
Copy link
Member Author

mroeschke commented Sep 14, 2018

FWIW @jorisvandenbossche I was in favor of 'D' meaning "calendar day" for all operations #20633. As I started refactoring, my biggest hurdle was 'D' is also heavily used within our Timedelta ops, and having Day act like a DateOffset(relativetimedelta) instead of a Tick (timedelta) broke a lot of tests and would take a lot of monkeypatching for 'D' to mean "calendar day" for timestamps and "absolute time" for timedeltas.

I didn't think it was worth the headache or the mincing of conventions since 'D' was built to act like a timedelta. So after that investigation it made me realize "we've been incorrectly using and documenting 'D' this entire time".

@jorisvandenbossche
Copy link
Member

Can you clarify what exactly broke for Timedelta usage? You have the creation where you can use such a string like pd.to_timedelta(.., unit='D') or pd.timedelta_range(..., freq='D'). Those we could still special case to keep them working (what of course would introduce a new inconsistency), or deprecate if we want.
But what were the other things that broke? I don't directly see how for a timedelta there would be a lot of change (but I don't use this often in combination with tz aware data), exactly because a 'day' in timedelta sense is always the same as 24H, since timedelta has no notion of timezones (just like 'D' and 'CD' are the same for tz-naive datetimes). Combining tz-aware datetimes with timedeltas would keep their absolute behaviour, as it was before.

@jbrockmendel
Copy link
Member

Sorry for being AWOL.

If people really want an absolute 24H, then they can use either "24H"

If we were to start from scratch, this is definitely the path I would want to go with.

There is a set of users/use cases that have tz-aware data, want CD-like behavior, and currently don't have it. For that subset, adding the "CD" option is a strict improvement, and nobody else has to be affected. This strikes me as a reasonable second-best.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 14, 2018

@jbrockmendel Can you clarify what you mean with adding the CD option?
Because you say "nobody else has to be affected" on adding 'CD' offset. Do you mean with that the merged PR (I don't see how that would not affect a lot of users, see my summary above), or adding 'CD' without changing meaning of 'D' (but then I don't see the need for 'CD') ?

@jbrockmendel
Copy link
Member

Can you clarify what you mean with adding the CD option?

It's possible I failed to follow the conversation closely enough. With that in mind:

For users with tz-naive data, D and CD are equivalent, so these users should be unaffected.

For users with tz-aware data who want the <=0.23.0 behavior, keeping "D" equivalent to "24H" should leave them unaffected (or has the meaning of "D" changed?)

It is the remaining subset of users/cases for whom "CD" is relevant.

@jorisvandenbossche
Copy link
Member

For users with tz-aware data who want the <=0.23.0 behavior, keeping "D" equivalent to "24H" should leave them unaffected (or has the meaning of "D" changed?)

So that is the whole point of the discussion: changing "D" to be equivalent to "24H" (in date_range, resample) breaks code, so this will not leave users unaffected. See my summary comment here: #22274 (comment)

@mroeschke
Copy link
Member Author

Can you clarify what exactly broke for Timedelta usage?

Unfortunately I don't have the branch anymore, but the gist of the change involved Day subclassing DateOffset instead of Tick (because DateOffset have relative arithmetic ops paths and Ticks have absolute arithmetic ops paths with Series, Indexes, Scalars, etc.), so I'd get errors similar to the following:

# Day as "calendar day" was implemented so that 'D'
# essentially equivalent as pd.DateOffset(days=1) 

In [4]: pd.Timedelta('1day') + pd.DateOffset(days=1)
TypeError: Cannot convert input [relativedelta(days=+2)] of type <class 'dateutil.relativedelta.relativedelta'> to Timestamp

In [6]: pd.timedelta_range('1day', freq=pd.DateOffset(days=1), periods=3)
ValueError: <Day> is a non-fixed frequency

In [7]: s = pd.Series(range(3), index=pd.timedelta_range('1day', periods=3))

In [8]: s.resample('D').mean()
ValueError: <Day> is a non-fixed frequency

So two popular timedelta operations with 'D', timedelta_range and resampling, broke. This would require a calendar day 'D' to take a special path for timedeltas, but as mentioned before I am -1 on 'D' behaving like absolute time for timedeltas and relative time for datetimes.

@jschendel
Copy link
Member

I don't have any issues with the functionality being discussed, but just have a slight quibble with the usage of 'CD' as the string alias.

Prior to the introduction of 'CD', my understanding was that a prefix of 'C' was reserved for custom frequencies, much like how a prefix of 'B' basically means the "business" equivalent frequency. Perhaps that's just an unintended pattern that I happened to pick up on though.

To add to this, we have the string alias 'CD' meaning "calendar day" but the CDay offset meaning "custom business day", which just seems unnecessarily confusing:

In [2]: pd.offsets.CDay()
Out[2]: <CustomBusinessDay>

I don't think that custom frequencies are used all that often (could be wrong), so I'm not objecting to breaking the 'C' prefix pattern, but just wanted to make sure we're cognizant that we're doing so. It's also possible that there are some additional implications that I could be overlooking.

I did a quick check and found one instance within the code that exploits a prefix of 'C' meaning custom, but it doesn't look like it really breaks anything of substance. I didn't do an extensive search, so it might be worthwhile to go through the code to see if I missed something elsewhere, but I imagine anything important would have been flagged by the tests.

@jorisvandenbossche
Copy link
Member

the gist of the change involved Day subclassing DateOffset instead of Tick (because DateOffset have relative arithmetic ops paths and Ticks have absolute arithmetic ops paths with Series, Indexes, Scalars, etc.), so I'd get errors similar to the following

@mroeschke that there were some breakages was to be expected I think? In the merged PR, there were also breakages, as you needed to change existing tests to get them passing.

So two popular timedelta operations with 'D', timedelta_range and resampling, broke

Yes, and in master two popular tz-aware operations with 'D' (the same ones) broke. I don't know which one of the two is more popular :-)
Anyhow, what I try to say is: both approaches have similar breaking consequences. So IMO neither of the possible breakages is necessarily a reason to prefer the other approach, and more importantly, with both approaches we need to think about how to properly deprecate / keep workarounds or aliases alive to minimize breakages.

I am -1 on 'D' behaving like absolute time for timedeltas and relative time for datetimes.

I currently don't see an 'ideal' solution, so in that light, personally I find special casing 'D' as absolute for timedelta reasonable. 'D' basically is absolute time in practice for most cases, only for tz-aware data it might deviate. And since timedelta's are always tz-naive, allowing 'D' for timedeltas for convenience, is not that "strange" to me.

People are used to use 'D', and people will continue to use 'D' for tz-naive datetime data. Having to remember to switch to 'CD' for tz-aware data is for me the biggest problem I have with the behaviour merged PR. Plus the fact that 'D', the one people are used to use, basically becomes useless for tz-aware data.

@mroeschke
Copy link
Member Author

mroeschke commented Sep 20, 2018

I agree that either change will incur breaking changes and that we'll need to adapt some depreciation behavior.

Besides the potential mixed relative/absolute semantics that 'D' could represent (that I am not a fan of), I am also concerned about the potential bugs and gotchas that it might introduce as there might be other places where datatime and timedelta transformations are tightly coupled.

Just curious, if 'CD' was kept, would you be okay with a depreciation cycle of resample('D') where 'D' was just mapped to 'CD'?

Also I think we should also continue this discussion at the dev chat next week so we can have a more contiguous discussion with everyone else.

@TomAugspurger
Copy link
Contributor

Quick summary from the meeting:

  1. We want consistency in the meaning of the string 'D' between contexts (resample / date_range & timedelta / offsets)
  2. We suspect that resample('D') / date_range(freq='D') are more commonly used, and so don't want to change that behavior
  3. @mroeschke has heroically volunteered to see if Timedelta("<n>D") (anywhere else in offsets?) can be deprecated in favor of some other string (e.g. '24H', '2TD')

@mroeschke
Copy link
Member Author

Closing in favor of further discussion in #22864 (especially about deprecation procedure)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Blocker Blocking issue or pull request for an upcoming release Frequency DateOffsets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants