Skip to content

Timestamp constructor parses ISO 8601 incorrectly near DST boundaries #8225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ischwabacher opened this issue Sep 9, 2014 · 21 comments
Closed
Labels
Timezones Timezone data dtype

Comments

@ischwabacher
Copy link
Contributor

Original post, edited to correct ISO 8601 formats. The originally titled bug does not exist. There is, however, some unexpected behavior:

It looks like short-form ISO 8601 parsing uses the right sign after all:

In [1]: import pandas as pd

In [2]: t = pd.Timestamp('2013-11-03 1:30:00', tz='America/Havana')

In [3]: t.strftime('%Y-%m-%d %H:%M:%S %Z %z')
Out[3]: '2013-11-03 01:30:00 CST -0500'

In [4]: t.strftime('%Y%m%dT%H%M%S%z')   # ISO 8601 short format
Out[4]: '20131103T013000-0500'

In [5]: pd.Timestamp(t.strftime('%Y%m%dT%H%M%S%z'))
Out[5]: Timestamp('2013-11-03 01:30:00-0500', tz='tzoffset(None, -18000)')

In [6]: t == _
Out[6]: True   # Not a bug after all!

In [7]: t.strftime('%Y-%m-%dT%H:%M:%S%z')   # ISO 8601 long format
Out[7]: '2013-11-03T01:30:00-0500'

In [8]: pd.Timestamp(t.strftime('%Y-%m-%dT%H:%M:%S%z'))
Out[8]: Timestamp('2013-11-03 01:30:00-0500', tz='pytz.FixedOffset(-300)')   # why pytz?

In [9]: t == _
Out[9]: True

In [10]: t.strftime('%Y%m%dT%H%M%SZ%z')   # Not a real ISO 8601 format (note the 'Z')
Out[10]: '20131103T013000Z-0500'

In [11]: pd.Timestamp(t.strftime('%Y%m%dT%H%M%SZ%z'))
Out[11]: Timestamp('2013-11-03 01:30:00+0500', tz='tzoffset(None, 18000)')
# This is the behavior that inspired the bug report.  I don't know if this is a valid parse
# or not, but it sure is unexpected.

It would be awfully embarrassing if I filed a bug report because I couldn't read ISO 8601...

However, in the process of investigating this issue, I encountered the following, which is definitely a bug:

In [12]: ts = ['2013-11-%s3 %s1:30:00' % (x, y) for x in ['', '0'] for y in ['', '0']]

In [13]: ts
Out[13]: 
['2013-11-3 1:30:00',
 '2013-11-3 01:30:00',
 '2013-11-03 1:30:00',
 '2013-11-03 01:30:00']   # only this one is ISO 8601

In [14]: tzs = ['America/%s' % s for s in ['Chicago', 'New_York', 'Havana']]

In [15]: [[pd.Timestamp(t, tz=tz) for t in ts] for tz in tzs]
Out[15]: 
[[Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'),
  Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'),
  Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/Chicago')],   # DST
 [Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'),
  Timestamp('2013-11-03 01:30:00-0400', tz='America/New_York')],   # DST
 [Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'),
  Timestamp('2013-11-03 00:30:00-0500', tz='America/Havana')]]   # Just plain wrong

I'm not sure what's going on here.

@jreback
Copy link
Contributor

jreback commented Sep 9, 2014

cc @sinhrks

cc @rockg

@jreback jreback added the Timezones Timezone data dtype label Sep 9, 2014
@ischwabacher
Copy link
Contributor Author

Looking at this more closely, something is very wrong:

In [1]: import pandas as pd

In [2]: pd.Timestamp('2013-11-03 01:30:00', tz='America/Havana')
Out[2]: Timestamp('2013-11-03 00:30:00-0500', tz='America/Havana')  # BAD!

In [3]: pd.Timestamp('2013-11-03 1:30:00', tz='America/Havana')
Out[3]: Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana')

@jreback jreback added this to the 0.15.0 milestone Sep 9, 2014
@rockg
Copy link
Contributor

rockg commented Sep 10, 2014

I can't reproduce this. I'm not on the bleeding edge, but not too many commits far behind.

Timestamp('2013-11-03 1:30:00', tz='America/Havana')
Out[1]: Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana')
Timestamp('2013-11-03 01:30:00', tz='America/Havana')
Out[1]: Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana')

Also, I think the string format up top is wrong. If you add spaces and a missing % you get it back correctly.

>>> pd.Timestamp(t.strftime('%Y%m%d %H%M%S %Z %z'))
Timestamp('2013-11-03 01:30:00-0500')

@rockg
Copy link
Contributor

rockg commented Sep 10, 2014

I see now that the issue is the short format even after adding the %. Let me take a look.

@rockg
Copy link
Contributor

rockg commented Sep 10, 2014

So this works without the Z. Aren't Z and %z contradictory in that %z implies an offset but Z doesn't imply one?

>>> pd.Timestamp(t.strftime('%Y%m%dT%H%M%S%z'))
Timestamp('2013-11-03 01:30:00-0500')

@ischwabacher
Copy link
Contributor Author

You're right about the string format; I misread ISO 8601.

The second issue persists for me, and I think it has to do with tzlocal; am I correct in assuming that you are not in the US/Central time zone? Pick a DST boundary in your local time zone and try the equivalent test there.

(EDIT: America/Havana uses "C[DS]T" for its abbreviations. So if your system time zone is US/Central, it's a good canary for telling when the machine is trying to parse the time zone abbreviation. Which it should never do, for exactly this reason.)

@rockg
Copy link
Contributor

rockg commented Sep 10, 2014

US/Eastern is my tz and that works fine as well and the same with tzlocal (I assume you were taking this from dateutil).

@ischwabacher ischwabacher changed the title Timestamp constructor parses ISO 8601 short form tz offset with wrong sign Timestamp constructor parses ISO 8601 incorrectly near DST boundaries Sep 10, 2014
@ischwabacher
Copy link
Contributor Author

Huh. I'm observing this issue on a build from source fetched about 36 hours ago.

I edited the OP. If you try the test in the second box, what do you observe?

@rockg
Copy link
Contributor

rockg commented Sep 10, 2014

Well this is fun. My versions are a few weeks old. I can check later tonight whether the behavior persists when I update.

[[Timestamp(t, tz=tz) for t in ts] for tz in tzs]
Out[18]: 
[[Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'),
  Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'),
  Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'),
  Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago')],
 [Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York')],
 [Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'),
  Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana')]]

Further, I agree that an error should be thrown when you pass in a non-ISO8601 format. Clearly it thought it could parse it when in fact it shouldn't be able to.

@rockg
Copy link
Contributor

rockg commented Sep 11, 2014

Okay, I can reproduce when I update my code. So something was introduced between 14.1 and the latest commit to cause a different behavior. What is the right behavior here? I agree that it should be consistent but I'm not sure what the right behavior is. I suppose a consistent offset across the cases would be the right thing here as there is nothing to differentiate those times.

[Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'), 
 Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'), 
 Timestamp('2013-11-03 01:30:00-0600', tz='America/Chicago'), 
Timestamp('2013-11-03 01:30:00-0500', tz='America/Chicago')]
[Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'), 
 Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'), 
 Timestamp('2013-11-03 01:30:00-0500', tz='America/New_York'), 
 Timestamp('2013-11-03 01:30:00-0400', tz='America/New_York')]
[Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'), 
 Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'), 
 Timestamp('2013-11-03 01:30:00-0500', tz='America/Havana'), 
 Timestamp('2013-11-03 00:30:00-0500', tz='America/Havana')]

@ischwabacher
Copy link
Contributor Author

I would support an ambiguous kwarg to the Timestamp constructor, à la #7963, with the same default. I think the docs should gently encourage the format of Timestamp.__repr__ as a best practice, though.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2014

@ischwabacher @rockg either you guys have a chance to look at this?

@ischwabacher
Copy link
Contributor Author

I tried a little while ago and was stymied by not being able to get cygdb to run. I'll give it another shot tonight.

@rockg
Copy link
Contributor

rockg commented Sep 23, 2014

@ischwabacher It may be worth running a git-bisect as something changed between 14.1 where the issue doesn't exist and master where it does. That would help narrow down the problem area.

@ischwabacher
Copy link
Contributor Author

.< That's so much less awful than what I did.

@ischwabacher
Copy link
Contributor Author

TIL why you squash.

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 29, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@adamgreenhall
Copy link
Contributor

this is still a problem, seems to alter hours between 2 and 6 on spring forward and hours between 2 and 5 on fall back:

In [1]: import pandas as pd
In [2]: print pd.__version__
0.16.0
In [3]: for hr in range(8):
    print hr
    for date in ['2015-01-01', '2015-03-08', '2015-11-01']:
        print pd.Timestamp('{} 0{}:00:00'.format(date, hr), tz='US/Eastern')

0
2015-01-01 00:00:00-05:00
2015-03-08 00:00:00-05:00
2015-11-01 00:00:00-04:00
1
2015-01-01 01:00:00-05:00
2015-03-08 01:00:00-05:00
2015-11-01 01:00:00-04:00
2
2015-01-01 02:00:00-05:00
2015-03-08 03:00:00-04:00
2015-11-01 01:00:00-05:00
3
2015-01-01 03:00:00-05:00
2015-03-08 04:00:00-04:00
2015-11-01 02:00:00-05:00
4
2015-01-01 04:00:00-05:00
2015-03-08 05:00:00-04:00
2015-11-01 03:00:00-05:00
5
2015-01-01 05:00:00-05:00
2015-03-08 06:00:00-04:00
2015-11-01 04:00:00-05:00
6
2015-01-01 06:00:00-05:00
2015-03-08 07:00:00-04:00
2015-11-01 06:00:00-05:00
7
2015-01-01 07:00:00-05:00
2015-03-08 07:00:00-04:00
2015-11-01 07:00:00-05:00

@ischwabacher
Copy link
Contributor Author

Oh, I totally forgot about this. I did bisect it and found the commit with the bug, but I never found the bug itself. At the very least I will post the SHA-1 when I have a moment.

@ischwabacher
Copy link
Contributor Author

...and I forgot to do this. The problem is in 6732306, presumably somewhere in parse_iso_8601_datetime, which is a horrible function. Why on earth would you handle "now" and "today" in your ISO8601 parsing code? Whyyy?

@jreback
Copy link
Contributor

jreback commented Oct 30, 2015

closing as is dupe of #11481 (simpler discussion :)

@jreback
Copy link
Contributor

jreback commented Oct 30, 2015

@ischwabacher yeh, the parsing should actually be before, though this DOES make it much faster I suspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

4 participants