Skip to content

API: Tick arithmetic respects calendar time #22195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

mroeschke
Copy link
Member

This PR changes Tick arithmetic to respect calendar time akin to Datoffset for Timestamp and DatetimIndex (and hence Series). There may be some embedded Tick work-around code (I am thinking in DatetimeIndex._generate and resample that may be refactorable after this change.


except NotImplementedError:
warnings.warn("Non-vectorized DateOffset being applied to Series "
"or DatetimeIndex", PerformanceWarning)
result = self.astype('O') + offset

return type(self)(result, freq='infer')
return type(self)(result, freq='infer', tz=self.tz)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed because if the code goes down the NotImplementedError path, the tz would be dropped.

DatetimeIndex
"""
# TODO: Add a vectorized DatetimeIndex.dst() method
ambiguous = np.array([bool(ts.dst()) if ts is not tslibs.NaT else False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could delay this ambiguous calculation until an AmbiguousTimeError is raised?

@jbrockmendel
Copy link
Member

Definitely +1 on doing this for Day, but it is less clear what this means for the other Tick subclasses. What's the motivation?

@mroeschke
Copy link
Member Author

For the other Ticks, it's mostly to keep them consistent with the definition of Day e.g. a user should expect 24 Hours to behave similarly as 1 Day. Otherwise the other Tick subclasses would be redundant with Timedeltas as well.

@jbrockmendel
Copy link
Member

For the other Ticks, it's mostly to keep them consistent with the definition of Day e.g. a user should expect 24 Hours to behave similarly as 1 Day. Otherwise the other Tick subclasses would be redundant with Timedeltas as well.

I can see some usefulness here, but what about use cases where the frequency should be Timedelta-like? e.g. if I call pd.date_range('2016-01-02 03:04', periods=1000, freq='H', tz='US/Pacific'), I'd be surprised if the resulting index skipped an hour at a DST change.

This is also going to affect timedeltas.delta_to_nanoseconds, I'll have to look at this more closely.

nanos = delta_to_nanoseconds(other)
result = Timestamp(self.value + nanos,
tz=self.tzinfo, freq=self.freq)
return result

elif hasattr(other, 'delta'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to import tslibs.offsets._Tick here and make this an isinstance check (thought that might cause a circular import, not sure off the top)

@gfyoung gfyoung added API Design Frequency DateOffsets Timezones Timezone data dtype labels Aug 4, 2018
@mroeschke
Copy link
Member Author

@jbrockmendel

I think the core confusion for users is the inherent ambiguity when specifying a string frequency like 'D' on whether the strings mean calendar time or absolute time.

Pandas advertises them as (Date)offsets, but there's an implicit (and not-clearly-documented) consistency throughout the API that string offsets passed to other methods like resample or date_range respect absolute time which is consistent with pd.tseries.frequencies.to_offset('D') returning a Day Tick that also respects absolute time. (This is probably why we have a lot of kludgey code dancing around Day)

That being said, I would love for all the frequency strings to map to calendar-respecting Ticks
/DateOffsets, but that would API breaking for df.resample('D') with timezones.

For the sake of consistency, I am okay with the API break, but interested to get your initial thoughts
Probably should open this discussion up to other too.

@codecov
Copy link

codecov bot commented Aug 5, 2018

Codecov Report

Merging #22195 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22195      +/-   ##
==========================================
+ Coverage   92.06%   92.06%   +<.01%     
==========================================
  Files         169      169              
  Lines       50694    50697       +3     
==========================================
+ Hits        46671    46674       +3     
  Misses       4023     4023
Flag Coverage Δ
#multiple 90.47% <100%> (ø) ⬆️
#single 42.32% <12.5%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 97.16% <100%> (ø) ⬆️
pandas/core/arrays/datetimes.py 95.49% <100%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 776fed3...d64ce4f. Read the comment docs.

@mroeschke
Copy link
Member Author

@jbrockmendel After thinking about it; I am convinced to only change Day to respect calendar arithmetic and leaving the rest of the ticks alone.

To do this I was thinking of having Day subclass DateOffset instead of Tick because the arithmetic paths are already determined by checking for DateOffset vs Tick. Any thoughts/reservations?

@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

@mroeschke ok with your suggestion, it might make some code simpler (though likely to break some tests I think)

@jbrockmendel
Copy link
Member

I think we're on the same page. I think we should make Day (and with it "D") correspond to calendar time like this PR does.

We also need to retain a way to specify timedelta-like frequencies, which is why I think we should keep Hour, Minute, ..., Nano as they are. Then a user can specify an absolute-day as "24H". Conversely they can specify "every day at noon and midnight" as ".5D".

There is also an issue (not specific to this PR, but closely related) of how to handle ambiguous/nonexistent times. If I start at ts = pd.Timestamp("2018-11-03 01:00:00", tz='US/Pacific') and add a calendar-Day offset, do we want Timestamp('2018-11-04 01:00:00-0700', tz='US/Pacific') or Timestamp('2018-11-04 01:00:00-0800', tz='US/Pacific')?

Conversely, If I start at ts = pd.Timestamp("2019-03-09 02:00:00", tz='US/Pacific') and add a calendar-Day, do we raise NonExistentTimeError or allow some fudge-factor?

@mroeschke
Copy link
Member Author

Thanks for the input @jreback and @jbrockmendel:

In regards to the Timestamp + Day = Ambiguous Datetime issue, I am partial toward raising an AmbiguousTimeError for "fall back" DST behavior and NonExistentTimeError for "spring forward" behavior for now. In the future we can consider adding a special kwarg to Day.apply to handle these ambiguous times.

@mroeschke
Copy link
Member Author

Closing as this will require a different implimentation

@mroeschke mroeschke closed this Aug 11, 2018
@mroeschke mroeschke deleted the day_offset branch September 9, 2018 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Frequency DateOffsets Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: offsets.Day should be one calendar day instead of 24 hours
4 participants