-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: fix origin epoch when freq is Day and harmonize epoch between timezones #34474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9de9dcd
to
9480311
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @mroeschke if you’d have a quick look
9480311
to
0635651
Compare
0635651
to
c42e788
Compare
c42e788
to
c78e29b
Compare
if isinstance(freq, Tick): | ||
index_tz = first.tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this actually be close to where you set the origin (e.g. in the TimeGrouper)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have access to the variables first
or end
to infer the index timezone in TimeGrouper.__init__
so I don't think I can do what you suggest. 😕
We used to infer the index_tz
exactly on the same place before #31809 as you can see here on the tag v.1.0.4
:
pandas/pandas/core/resample.py
Line 1650 in 1ce1c3c
tz = first.tz |
In my opinion that make sense since the API of groupby
(shared with the resample
API) can accept any kind of grouper, so the bins (that needs the information of first
and end
) can only be calculated when we apply the grouper against a Series or a DataFrame.
import pandas as pd
import numpy as np
start, end = "2000-08-02 23:30:00+0500", "2000-12-02 00:30:00+0500"
rng = pd.date_range(start, end, freq="7min")
ts_1 = pd.Series(np.random.randn(len(rng)), index=rng)
ts_2 = pd.Series(np.random.randn(len(rng)), index=rng)
# => The example line that create this impossibility:
grouper = pd.Grouper(freq="D", origin="epoch")
result_1 = ts_1.groupby(grouper).mean()
result_2 = ts_2.groupby(grouper).mean()
thanks @hasB4K very nice |
Follow-up of #31809.
The purpose of this PR is to fix the current behavior on master:
Outputs on master:
Expected Outputs (with this PR):
I thought of two possible solutions while fixing that:
pd.Timestamp(0, tz=index_tz)
pd.Timestamp("1970-01-01", tz=index_tz)
To have similar results between timezones, I thought the solution 2 was a better choice. The solution 2 could also help us to set the behavior
origin=epoch
as the default one in a future version since the results would be quite similar withorigin=start_day
(that is not quite the case with a DST timezone, I can provide further details if needed on why this is the case).(The solution 1 is correct in theory but is giving results that could be hard to explain to the end user.)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff