-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Timestamp UTC offset incorrect for dateutil tz in ambiguous DST time #31338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @pganssle |
other options
|
dateutil could also expose a setting to signal if |
Just in case it's useful. The reason that correction in |
I don't think you've quite diagnosed the problem correctly. It is true that pandas apparently uses non-standard datetime arithmetic semantics, which could cause problems, but in this case the issue seems to be that
|
xref #25057 for |
@pganssle IN:
EPOCH = datetime.datetime.utcfromtimestamp(0)
t1 = pd.Timestamp(1382833800000000000, tz='dateutil/Europe/London')
t1.value
OUT:
1382833800000000000
IN:
t1.replace(tzinfo=None).value
OUT:
1382837400000000000 So we can't tell the difference between the first and the second time an ambiguous time occurs after dropping the tzinfo information. My hypothesis that |
take |
@pganssle |
@AlexKirko No, dateutil has a backport that returns a datetime subclass with a fold attribute if the normal datetime object doesn't have a fold attribute. In this case it shouldn't matter much, because Timestamp is itself a datetime subclass, and so the main thing that matters is that it has a meaningful fold attribute. |
Thanks, that clears things up. Started looking into the issue. Caught a cold, but I still hope to have something by the end of the week. |
Well, I do have an MVP, but there is still stuff to be done, and I can't seem to find the source of the performance impact that the changes introduce. |
Code Sample, a copy-pastable example if possible
If we make a Timestamp in an ambiguous DST period while specifying via the offset (or by supplying
Timestamp.value
directly) that the time is before DST switch, the representation then shows that this is after DST switch. This is backed up by callingTimestamp.tz.utcoffset(Timestamp)
.Problem description
The reason for this bug looks to be buried deep in the interaction of
pandas
anddateutil
.So this is what I've been able to dig up. When we try to determine whether we are in DST or not, we rely on
timezone.utcoffset
of the underlying timezone package. What gets executed indateutil
is this:dateutil
is expecting an ordinarydatetime.timedelta
object here, so this is what it does:_find_last_transition
to get the index of the last DST transition beforedt
. This is done by computingtimedelta.total_seconds
since epoch time. Ourpandas.Timedelta.total_seconds
is smart, and returns differenttotal_seconds
for before and afterDST
, since we basically returnTimedelta.value
which is the same asTimestamp.value
when counting since epoch time (because of how_Timestamp.__sub__
inc_timestamp.pyx
is implemented).This is what we do (doesn't care about
dt.replace(tzinfo=None)
):This is what
datetime.timedelta
does (loses DST awareness afterdt.replace(tzinfo=None)
):_resolve_ambiguous_time
corrects for ambiguous times, sincedatetime.timedelta.total_seconds
afterdt.replace(tzinfo=None)
isn't DST-aware. It checks if we are in an ambiguous period and if this is the first time this time has occured: this is whatself._fold
is for. fold is 0 for the first time, and 1 for the second time. If it's the first time,dateutil
shifts the relevant transition index back by 1, since it thinks thattotal_seconds
always returns the number of seconds calculated using the second time.I'd like to discuss how we are going to approach this. From what I see, there isn't much we can do on our end.
MakingScratch that. The problem isn't so much thetotal_seconds
non-DST-aware by default is bad, because that would be making our implementation less precise unless the user passes a parameter.total_seconds
implementation as it is theTimestamp.__sub__
implementation which preservesvalue
when we subtract epoch time.Another approach is to go to
dateutil
with this and implement a check there to avoid running the correction if they are dealing with apandas.Timedelta
. Might be tricky to do without introducing a dependency on pandas, though.First came across this while solving #24329 in #30995
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None
pandas : 0.26.0.dev0+1947.gca3bfcc54.dirty
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.2.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0
The text was updated successfully, but these errors were encountered: