-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Add Calendar Day Frequency ("CD") #22274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Only partially related, but I've also thought it would be nice to have frequencies for other DateOffset cases, e.g. a calendar year like the below.
So, maybe a whole set of |
@mroeschke can you give an example of that? For example the below has a DST transition, but it keeps calendar days:
|
Ah yes, you're correct @jorisvandenbossche. I had misspoke; Basically the crux of the issue is that
So I think this is confusing, inconsistent to users, and the codebase has special caseslike It's pretty difficult to convert Overall I think it's easier and cleaner to make |
I don't debate the inconsistency, and that it would be nice to clean this up. But there might be other ways to solve that.
Yep, but we could say that 'D' is here interpreted as 24H, as timedelta's cannot hold a non-fixed frequency anyway The current PR breaks (all in case of tz-aware data of course):
IMO that impact is too big to be justifiable, but that's of course open for discussion. But I think we can certainly not do such a change without proper deprecations. |
Cases where
And the only reason pandas/pandas/core/indexes/datetimes.py Line 530 in 0409521
Also if To ease the transition for |
Or you could also say: the behavior as calendar day for date_range/resample was very much intentional because we explicitly special cased for it .. :) The I personally do never use |
Sure it was intentional, but it ultimately conflated the original meaning of My biggest -1 is the potential for For all intents and purposes, I am open for Any thoughts @jbrockmendel? |
Timedelta's have by definition no timezone, and for tz-naive data 'D' effectively means calendar day.
I might not be the majority of our users, but certainly a significant subset. But for me the question is rather: of those users (the ones that use timezones), how many use explicit arithmetic with offsets or use shift, or how many use resample and date_range. And I think here the majority will be in the latter group, which is affected by this change.
That shouldn't wb the goal of an rc IMO. When we release an rc, it should be what we think is best, and to catch situations or breadth of impact we have not anticipated ourselves. Here I think we know quite well that some prominent use cases will break. For me, the main issue I have: when having tz aware data, To be clear, I think we should further discuss this (I am not necessarily tied to having 'D' be calendar day), but I object strongly to what is now in master. If we choose to go that route, it should at least be with a deprecation cycle. |
cc @pandas-dev/pandas-core I think this is worth a broader discussion than only Matt and me :) Very short (probably too short and possibly biased) summary: Problem: The string 'D' means different things in different places:
PR merged in master #22288 solved this inconsistency by changing 'D' to always mean 24H, and add The benefit is that this removed the inconsistency in interpretation of 'D' and the differences in arithmetic operations vs date_range/resample. But this has the following consequences (all assuming you have tz-aware data, and that you have data covering at least one DST transition):
So the questions are: is solving this inconsistency worth such a break (and I think my opinion is clear here :)), or is there a possible deprecation path, or are there alternatives to solve the inconsistency? |
Not that bias at all @jorisvandenbossche :) thanks for the summary! |
Does that go for |
|
Apart from that there is no default we can change for resample, I personally can't think of a use case where you would ever want the absolute version (and even if you want that, I think '24H' would be much more explicit) |
Ah, of course. One (probably bad option) is to special case the string "D" just for resample, and warn that it's treated as |
That is basically the current situation, no? |
Hmm that would be interesting depreciation cycle of |
Right, so it would still be a breaking change for people doing The benefit of treating "D" as "CD", but immediately changing |
I am not sure Tom's latest suggestion was to deprecate ("warn" can of course point to that), or actually keep it as an alias. If we keep the idea of the merged PR, that's the minimum we should do: deprecate first before actually changing behaviour. I don't think that would be hard: simply keep But for me, the question is still if we want that. For example, IMO it would be quite annoying that you need to remember to do |
Indeed less common, but we could still raise a similar deprecation warning for that case as well, no? In my subjective assessment*, for the majority of our users, 'D' has always meant "calendar day", and only for a minor part it actually meant absolute time (the users that used offsets.Day() directly). * any way we could check this? Do we know big downstream projects that use heavily tz-aware time series data? |
FWIW @jorisvandenbossche I was in favor of I didn't think it was worth the headache or the mincing of conventions since |
Can you clarify what exactly broke for Timedelta usage? You have the creation where you can use such a string like |
Sorry for being AWOL.
If we were to start from scratch, this is definitely the path I would want to go with. There is a set of users/use cases that have tz-aware data, want CD-like behavior, and currently don't have it. For that subset, adding the "CD" option is a strict improvement, and nobody else has to be affected. This strikes me as a reasonable second-best. |
@jbrockmendel Can you clarify what you mean with adding the CD option? |
It's possible I failed to follow the conversation closely enough. With that in mind: For users with tz-naive data, D and CD are equivalent, so these users should be unaffected. For users with tz-aware data who want the <=0.23.0 behavior, keeping "D" equivalent to "24H" should leave them unaffected (or has the meaning of "D" changed?) It is the remaining subset of users/cases for whom "CD" is relevant. |
So that is the whole point of the discussion: changing "D" to be equivalent to "24H" (in date_range, resample) breaks code, so this will not leave users unaffected. See my summary comment here: #22274 (comment) |
Unfortunately I don't have the branch anymore, but the gist of the change involved
So two popular timedelta operations with |
I don't have any issues with the functionality being discussed, but just have a slight quibble with the usage of Prior to the introduction of To add to this, we have the string alias In [2]: pd.offsets.CDay()
Out[2]: <CustomBusinessDay> I don't think that custom frequencies are used all that often (could be wrong), so I'm not objecting to breaking the 'C' prefix pattern, but just wanted to make sure we're cognizant that we're doing so. It's also possible that there are some additional implications that I could be overlooking. I did a quick check and found one instance within the code that exploits a prefix of 'C' meaning custom, but it doesn't look like it really breaks anything of substance. I didn't do an extensive search, so it might be worthwhile to go through the code to see if I missed something elsewhere, but I imagine anything important would have been flagged by the tests. |
@mroeschke that there were some breakages was to be expected I think? In the merged PR, there were also breakages, as you needed to change existing tests to get them passing.
Yes, and in master two popular tz-aware operations with 'D' (the same ones) broke. I don't know which one of the two is more popular :-)
I currently don't see an 'ideal' solution, so in that light, personally I find special casing 'D' as absolute for timedelta reasonable. 'D' basically is absolute time in practice for most cases, only for tz-aware data it might deviate. And since timedelta's are always tz-naive, allowing 'D' for timedeltas for convenience, is not that "strange" to me. People are used to use 'D', and people will continue to use 'D' for tz-naive datetime data. Having to remember to switch to 'CD' for tz-aware data is for me the biggest problem I have with the behaviour merged PR. Plus the fact that 'D', the one people are used to use, basically becomes useless for tz-aware data. |
I agree that either change will incur breaking changes and that we'll need to adapt some depreciation behavior. Besides the potential mixed relative/absolute semantics that 'D' could represent (that I am not a fan of), I am also concerned about the potential bugs and gotchas that it might introduce as there might be other places where datatime and timedelta transformations are tightly coupled. Just curious, if 'CD' was kept, would you be okay with a depreciation cycle of Also I think we should also continue this discussion at the dev chat next week so we can have a more contiguous discussion with everyone else. |
Quick summary from the meeting:
|
Closing in favor of further discussion in #22864 (especially about deprecation procedure) |
This is a proposal to replace #20633.
offsets.Day
('D'
) is documented to represent calendar day; however,Day
arithmetic and usage of the'D'
offset alias (e.g.df.resample('D')
,timedelta_range(..., freq='D')
, etc.) currently respects absolute time (i.e.Day
acts likeTimedelta(days=1)
). This usage is ingrained pretty deeply in a lot of methods and operations and is fairly difficult to walk back to respect the notion of calendar day as stated in the docs.Instead, I propose to keep
Day
as is (a timedelta-like frequency likeHour
('H'
)) and add a new frequencyCalendarDay
,'CD'
. It would act very similarly toDateOffset(day=1)
but act like a frequency. Thoughts?The text was updated successfully, but these errors were encountered: