Skip to content

API: offsets.Day should be one calendar day instead of 24 hours #20633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
mroeschke opened this issue Apr 8, 2018 · 12 comments
Closed
3 tasks

API: offsets.Day should be one calendar day instead of 24 hours #20633

mroeschke opened this issue Apr 8, 2018 · 12 comments
Labels
API Design Frequency DateOffsets Master Tracker High level tracker for similar issues Timezones Timezone data dtype

Comments

@mroeschke
Copy link
Member

mroeschke commented Apr 8, 2018

xref issues

As I understand the offset classes, they are supposed to respect transitions in "wall time" instead of absolute time. Currently the Day offset is currently defined to be 24 hours instead of 1 calendar day which is problematic to respect "wall time" wrt DST transitions.

In [25]: foo
Out[25]: Timestamp('2016-10-30 00:00:00+0300', tz='Europe/Helsinki') # DST change on this day

In [26]: foo + pd.tseries.offsets.DateOffset(weeks=1)
Out[26]: Timestamp('2016-11-06 00:00:00+0200', tz='Europe/Helsinki')

In [27]: foo + pd.tseries.offsets.Week()
Out[27]: Timestamp('2016-11-06 00:00:00+0200', tz='Europe/Helsinki') # respects calendar transition

In [28]: foo + pd.Timedelta(weeks=1)
Out[28]: Timestamp('2016-11-05 23:00:00+0200', tz='Europe/Helsinki')

In [29]: foo + pd.tseries.offsets.DateOffset(days=1)
Out[29]: Timestamp('2016-10-31 00:00:00+0200', tz='Europe/Helsinki')

In [30]: foo + pd.tseries.offsets.Day()
Out[30]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki') # does not respects calendar transition

In [31]: foo + pd.Timedelta(days=1)
Out[31]: Timestamp('2016-10-30 23:00:00+0200', tz='Europe/Helsinki')

From prior issues, #20596, #16980, #8774, it seems like many are confused that 1 day = 24 hours when the DST transition comes up and defining 1 day = 1 calendar day would be more intuitive and consistent with other offsets. @

@mroeschke mroeschke changed the title API: Change pd.offsets.Day to be one calendar day instead of 24 hours API: Change offsets.Day to be one calendar day instead of 24 hours Apr 8, 2018
@mroeschke mroeschke changed the title API: Change offsets.Day to be one calendar day instead of 24 hours API: offsets.Day should be one calendar day instead of 24 hours Apr 8, 2018
@jbrockmendel
Copy link
Member

+1. As it is, Ticks are all redundant with Timedeltas.

@jreback jreback added API Design Frequency DateOffsets Timezones Timezone data dtype Master Tracker High level tracker for similar issues labels Apr 9, 2018
@jreback jreback added this to the Next Major Release milestone Apr 9, 2018
@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

@mroeschke happy for you to have a look at the implications of changing this. i consolidated the other issues to this one.

@mroeschke
Copy link
Member Author

So currently Day (and smaller freq resolutions) are subclassed from Tick (which subclasses from DateOffset) instead of directly subclassing from DateOffset like other offsets. I think the easiest path for gaining (intuitive) consistency between offsets is to have Day directly from DateOffset.

It appears the main reason why Tick was created was to have a separate path for ops compared to DateOffsets (which creates this confusing behavior), but is there another reason (historical or other feature) why Day should still subclass from Tick?

@jbrockmendel
Copy link
Member

I can't speak to the historical reasons for the design, but I know that the Tick classes have much more performant implementations of a number of methods.

The easiest way to change the behavior to that suggested by the OP is to add the line _adjust_dst = True to the Day class.

@mroeschke
Copy link
Member Author

Thanks for the tip @jbrockmendel; good enough reason to keep Tick around. I'll start exploring that avenue then.

@mroeschke
Copy link
Member Author

mroeschke commented Apr 27, 2018

Regarding date_range(start, end, freq=...) freq parameter respecting calendar time, Should downsampling also respect calendar day across dst? For example:

In [12]: df
Out[12]:
                           0
2016-03-09 09:33:20-06:00  1
2016-03-15 11:33:20-05:00  2

# On my branch, respecting wall time
In [13]: df.resample('12h', closed='right', label='right').last().ffill()
Out[13]:
                             0
2016-03-09 12:00:00-06:00  1.0
2016-03-10 00:00:00-06:00  1.0
2016-03-10 12:00:00-06:00  1.0
2016-03-11 00:00:00-06:00  1.0
2016-03-11 12:00:00-06:00  1.0
2016-03-12 00:00:00-06:00  1.0
2016-03-12 12:00:00-06:00  1.0
2016-03-13 00:00:00-06:00  1.0
2016-03-13 12:00:00-05:00  1.0
2016-03-14 00:00:00-05:00  1.0
2016-03-14 12:00:00-05:00  1.0
2016-03-15 00:00:00-05:00  1.0
2016-03-15 12:00:00-05:00  2.0

# Currently in master, respecting absolute time
In [14]: expected
Out[14]:
                             0
2016-03-09 12:00:00-06:00  1.0
2016-03-10 00:00:00-06:00  1.0
2016-03-10 12:00:00-06:00  1.0
2016-03-11 00:00:00-06:00  1.0
2016-03-11 12:00:00-06:00  1.0
2016-03-12 00:00:00-06:00  1.0
2016-03-12 12:00:00-06:00  1.0
2016-03-13 00:00:00-06:00  1.0
2016-03-13 13:00:00-05:00  1.0
2016-03-14 01:00:00-05:00  1.0
2016-03-14 13:00:00-05:00  1.0
2016-03-15 01:00:00-05:00  1.0
2016-03-15 13:00:00-05:00  2.0

resample uses some date_range code and therefore will get impacted. IMO this seems like a pretty undesirable change, but I wanted to get some feedback.

@jbrockmendel
Copy link
Member

@mroeschke would it make sense in this context to distinguish between "12h" and ".5d" as the argument to df.resample?

@mroeschke
Copy link
Member Author

mroeschke commented Apr 28, 2018

It would be nice to distinguish between 12 hours of absolute time and 12 hours ("0,5 days") of wall time, it seems the Pandas resampling ship has (rightly) sailed toward binning with absolute time (or would take a lot of effort to support resampling by wall time). For example:

In [8]: idx = pd.date_range(pd.Timestamp('2016-10-30 01:00:00+0300', tz='Europe/Helsinki'), pd.Timestamp('2016-10-30 04:00:00+0200', tz='Europe/Helsinki'), freq='H')

In [9]: df = pd.DataFrame([1, 1, 1, 2, 2], index=idx)

In [10]: df
Out[10]:
                           0
2016-10-30 01:00:00+03:00  1
2016-10-30 02:00:00+03:00  1
2016-10-30 03:00:00+03:00  1
2016-10-30 03:00:00+02:00  2
2016-10-30 04:00:00+02:00  2

# Currently resampling (I think the last element might be a bug?)
In [11]: df.resample('3H').max()
Out[11]:
                             0
2016-10-30 00:00:00+03:00  1.0
2016-10-30 03:00:00+03:00  2.0
2016-10-30 05:00:00+02:00  NaN

# Theoretical resampling by "wall time"
# include the value associated with 03:00:00+02:00 in the first bin
In [12]: df.resample('3H').max()
Out[12]:
                             0
2016-10-30 00:00:00+03:00  2.0
2016-10-30 03:00:00+03:00  2.0
2016-10-30 05:00:00+02:00  NaN

Plus, I wouldn't know what the correct bin label would be when resampling with wall time, (03:00:00+03:00 or 03:00:00+02:00 for Out[12]?)

@mroeschke
Copy link
Member Author

To summarize thoughts:

  1. I think Timestamp, DatetimeIndex, and related ops with Ticks can be feasibly converted to wall time arithmetic.

  2. Having date_range's freq param refer to calendar offsets is trickier since it impacts resample, unless constructing the DatetimeIndex after resample is done in a more custom manner.

@jbrockmendel
Copy link
Member

@mroeschke did you ever try this with setting _adjust_dst = True in the Day class?

@mroeschke
Copy link
Member Author

I remember trying that local once but it didn't solve all cases i.e. I think either Timestamp or DatetimeIndex or both had specific Dateoffset/Tick ops logic that doesn't consider _adjust_dst .

@mroeschke
Copy link
Member Author

mroeschke commented Sep 9, 2018

Closed by #22288

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Frequency DateOffsets Master Tracker High level tracker for similar issues Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants