ENH: Parse %z and %Z directive in format for to_datetime #19979

mroeschke · 2018-03-03T04:49:33Z

closes ENH: bad directive in to_datetime format - this uses std. strptime zone offset #13486
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The implimentiontion is near identical to https://github.com/python/cpython/blob/master/Lib/_strptime.py an currently works as datetime.strptime would:

In [3]: f = '%Y-%m-%d %H:%M:%S %Z %z'

# datetime will parse the case below even though UTC with +0100 should be bogus
In [4]: d = ['2010-01-01 12:00:00 UTC +0100'] * 10

In [5]: pd._libs.tslibs.strptime.array_strptime(np.array(d, dtype='object'), f)
Out[5]:
(array(['2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000',
        '2010-01-01T12:00:00.000000000', '2010-01-01T12:00:00.000000000'],
       dtype='datetime64[ns]'),
 array([datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC'),
        datetime.timezone(datetime.timedelta(0, 3600), 'UTC')],
       dtype=object))

In [29]: datetime.strptime('2010-01-01 12:00:00 UTC +0100', '%Y-%m-%d %H:%M:%S %Z %z')
Out[29]: datetime.datetime(2010, 1, 1, 12, 0, tzinfo=datetime.timezone(datetime.timedelta(0, 3600), 'UTC'))

Currently, an offset needs to get passed (%z) in order for the tzname used be used (%Z).

I'd like to get feedback of what this function should return having parsed %z or %Z. It may be difficult to return a normal DatimeIndex/Series/array given the following edge cases:

User passes strings with different tz offsets ([date] +0100, [date]. -0600, [date] +1530)
User passes strings with different tz names ([date] UTC, [date]. EST, [date] CET)
User passes strings with incompatable tz name and offset (see example above)

I suppose the most agnostic thing to is to return an array of Timestamps?

In [27]: [pd.Timestamp(val, tzinfo=tz) for val, tz in zip(*pd._libs.tslibs.strptime.array_strptime(np.array(d, dtype='object'), f))]
Out[27]:
[Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC'),
 Timestamp('2010-01-01 13:00:00+0100', tz='UTC')]

jreback

need tests

jreback · 2018-03-05T11:44:47Z

pandas/core/tools/datetimes.py

@@ -344,7 +344,7 @@ def _convert_listlike(arg, box, format, name=None, tz=tz):
                if result is None:
                    try:
                        result = array_strptime(arg, format, exact=exact,
-                                                errors=errors)
+                                                errors=errors)[0]


why was this changed?

jreback · 2018-03-05T11:45:29Z

pandas/_libs/tslibs/strptime.pyx

+                        z = z[:3] + z[4:]
+                        if len(z) > 5:
+                            if z[5] != ':':
+                                msg = "Unconsistent use of : in {0}"


Inconsistent

codecov · 2018-03-09T06:49:26Z

Codecov Report

Merging #19979 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19979      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49506    49516      +10     
==========================================
+ Hits        45467    45477      +10     
  Misses       4039     4039

Flag	Coverage Δ
#multiple	`90.24% <100%> (ø)`	⬆️
#single	`41.87% <23.07%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/tools/datetimes.py	`84.98% <100%> (+0.54%)`	⬆️
pandas/core/arrays/categorical.py	`95.67% <0%> (-0.01%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36c1f6b...757458d. Read the comment docs.

mroeschke · 2018-03-09T07:04:48Z

Summary of the logic thus far and open to feedback:

array_strptime will now return 3 arrays of the datetime64 value, tzname, and tzoffset separately.

If the tzname directive is only passed, %Z, dispatch to Datimetime's tz arg or Timestamp's tz arg if there are multiple different tznames are parsed.
If the tzoffset directive is only passed, %z, create a pytz.FixedOffset and dispatch to Datimetime's tz arg or Timestamp's tz arg if there are multiple different tzoffset are parsed.
If a tzname and tzoffet directives are passed, create a datetime.timezone object (similar to the cpython implementation) and just dispatch to Timestamp. One question here is if we should invalidate bogus timezones (e.g. UTC +0600, PST -1500, etc)

mroeschke · 2018-03-09T07:13:27Z

I should be able to transfer this logic into cython on the next iteration.

jreback · 2018-03-13T10:19:05Z

pandas/_libs/tslibs/strptime.pyx

+                if z == 'Z':
+                    gmtoff = 0
+                else:
+                    if z[3] == ':':


need a check on the len upfront here

jreback · 2018-03-13T10:19:25Z

pandas/_libs/tslibs/strptime.pyx

+                    seconds = int(z[5:7] or 0)
+                    gmtoff = (hours * 60 * 60) + (minutes * 60) + seconds
+                    gmtoff_remainder = z[8:]
+                    # Pad to always return microseconds.


jreback · 2018-03-13T10:20:37Z

pandas/_libs/tslibs/strptime.pyx

@@ -281,6 +290,32 @@ def array_strptime(ndarray[object] values, object fmt,
                        else:
                            tz = value
                            break
+            elif parse_code == 19:


can you move this whole parse to a function and just all it here (and return the values as a tuple)

jreback · 2018-03-13T10:21:25Z

pandas/core/tools/datetimes.py

@@ -343,8 +346,74 @@ def _convert_listlike(arg, box, format, name=None, tz=tz):
                # fallback
                if result is None:
                    try:
-                        result = array_strptime(arg, format, exact=exact,
-                                                errors=errors)
+                        parsing_tzname = '%Z' in format


woa, what do you need all this for???

jreback · 2018-03-13T10:21:41Z

pandas/tests/indexes/datetimes/test_tools.py

@@ -183,6 +183,63 @@ def test_to_datetime_format_weeks(self, cache):
        for s, format, dt in data:
            assert to_datetime(s, format=format, cache=cache) == dt

+    @pytest.mark.skipif(not PY3,
+                        reason="datetime.timezone not supported in PY2")
+    def test_to_datetime_parse_timezone(self):


import at the top

jreback · 2018-03-13T10:22:54Z

pandas/tests/indexes/datetimes/test_tools.py

+        tm.assert_numpy_array_equal(result, expected)
+
+        # %z and %Z parsing
+        dates = ['2010-01-01 12:00:00 UTC +0100'] * 2


need more checking for invalid (partially formed such as +0, +1foo, UTCbar

jreback · 2018-03-13T10:23:15Z

pandas/tests/indexes/datetimes/test_tools.py

+        tm.assert_index_equal(result, expected)
+
+        result = pd.to_datetime(dates, format=fmt, box=False)
+        expected = np.array(expected_dates, dtype=object)


use assert_index_equal always

mroeschke · 2018-03-15T20:22:00Z

@jreback I was able to simplify a lot of my logic ( I underestimated how Index would get cast to DatetimeIndex when passed Timestamps with the same tz).

I created separate functions to parse the timezone directive and then box the result.

jreback · 2018-03-16T10:20:18Z

pandas/_libs/tslibs/strptime.pyx

@@ -632,3 +655,48 @@ cdef _calc_julian_from_U_or_W(int year, int week_of_year,
    else:
        days_to_week = week_0_length + (7 * (week_of_year - 1))
        return 1 + days_to_week + day_of_week
+
+cdef _parse_timezone_directive(object z):


can be de-privatized (no leading _); these modules are all private.

jreback · 2018-03-16T10:20:50Z

pandas/_libs/tslibs/strptime.pyx

+
+    if z == 'Z':
+        gmtoff = 0
+        gmtoff_fraction = 0


would just directly return for this case (0, 0)

then don't need the else

jreback · 2018-03-16T10:21:25Z

pandas/_libs/tslibs/strptime.pyx

+        gmtoff = 0
+        gmtoff_fraction = 0
+    else:
+        if z[3] == ':':


you might need to wrap this entire block in a try/except if the string is not long enough (or check lengths for each sub-section) and the raise the appropriate error

jreback · 2018-03-16T10:22:05Z

pandas/core/tools/datetimes.py

+            ts = tslib.Timestamp(res)
+            ts = ts.tz_localize(tzoffset)
+            tz_results.append(ts)
+        tz_results = np.array(tz_results)


what do you need all of this for, this is jumping thru a lot of hoops here

This elif branch is:

Creating a pytz.FixedOffset from the parsed offset

Creating a naive Timestamp, then localizing it to the pytz.FixedOffset (can't do it directly like Timezone(res, tz=pytz.FixedOffset(...)) because of my realization from DOC: Clarify passing epoch timestamp to Timestamp with timezone. #20257)

jreback · 2018-03-16T10:22:22Z

pandas/tests/indexes/datetimes/test_tools.py

@@ -25,6 +25,12 @@
 from pandas import (isna, to_datetime, Timestamp, Series, DataFrame,
                    Index, DatetimeIndex, NaT, date_range, compat)

+if PY3:
+    from datetime import timezone


I think we have a fixture for this

We have a fixture for timezone.utc, but I am testing parsing a custom timezone.

jreback · 2018-03-16T10:22:33Z

pandas/tests/indexes/datetimes/test_tools.py

+    def test_to_datetime_parse_tzname_or_tzoffset(self, box, const,
+                                                  assert_equal, fmt,
+                                                  dates, expected_dates):
+        # %z or %Z parsing


add the issue number

jreback · 2018-03-16T10:22:42Z

pandas/tests/indexes/datetimes/test_tools.py

+                                                   assert_equal, dates,
+                                                   expected_dates):
+        # %z and %Z parsing
+        fmt = '%Y-%m-%d %H:%M:%S %Z %z'


return timedeltas as list return timedeltas in a numpy array some flake fixes Extend logic of parsing timezones address comment misspelling Add additional tests address timezone localization

mroeschke · 2018-05-23T04:43:03Z

I've changed a couple of things after your review @jreback

Either %z or %Z can be parsed (not together)
%z will return a pytz.FixedOffset

In [3]: pd.to_datetime(['2010-01-01 12:00:00 +0100'], format='%Y-%m-%d %H:%M:%S %z')
Out[3]: DatetimeIndex(['2010-01-01 12:00:00+01:00'], dtype='datetime64[ns, pytz.FixedOffset(60)]', freq=None)

%Z can parse any timezone specified by pytz.all_timezones

In [2]: pd.to_datetime(['2010-01-01 12:00:00 US/Pacific'], format='%Y-%m-%d %H:%M:%S %Z')
Out[2]: DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/Pacific]', freq=None)

jreback · 2018-05-23T10:59:09Z

pandas/_libs/tslibs/strptime.pyx

        dict found_key
        bint is_raise = errors=='raise'
        bint is_ignore = errors=='ignore'
        bint is_coerce = errors=='coerce'
        int ordinal
+        dict _parse_code_table = {'y': 0,


you could make this a module level variable

jreback · 2018-05-23T11:00:46Z

pandas/core/tools/datetimes.py

+                            raise ValueError("Cannot pass a tz argument when "
+                                             "parsing strings with timezone "
+                                             "information.")
+                        result, timezones = array_strptime(


I would much rather do the error handling in the _return_parsed_timezone_results. This block is just very complicated and hard to grok

jreback · 2018-05-24T01:09:08Z

lgtm @mroeschke

merge on green.

jorisvandenbossche · 2018-05-24T14:17:35Z

New features are for 0.24.0, @mroeschke can you move the whatsnew?

jorisvandenbossche

Nice addition!
Added some comments

jorisvandenbossche · 2018-05-24T14:38:32Z

pandas/tests/indexes/datetimes/test_tools.py

+         [pd.Timestamp('2010-01-01 12:00:00', tz='UTC'),
+          pd.Timestamp('2010-01-01 12:00:00', tz='GMT'),
+          pd.Timestamp('2010-01-01 12:00:00', tz='US/Pacific')]],
+        ['%Y-%m-%d %H:%M:%S %z',


Can you one of them, eg this one, without the space before the tz?

jorisvandenbossche · 2018-05-24T14:40:35Z

pandas/tests/indexes/datetimes/test_tools.py

+        ['%Y-%m-%d %H:%M:%S %z',
+         ['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'],
+         [pd.Timestamp('2010-01-01 12:00:00',
+                       tzinfo=pytz.FixedOffset(0)),


Should this be UTC or a fixed offset of 0 ?

pytz coerces a fixed offset of 0 to UTC

In [2]: pytz.FixedOffset(0) Out[2]: <UTC>

But making it explicit here that %z should return pytz.FixedOffset(0)

So the actual DatetimeIndex you get here has UTC timezone? OK, that's good! (but maybe add a small comment since I would not expect that)

jorisvandenbossche · 2018-05-24T14:44:56Z

pandas/tests/indexes/datetimes/test_tools.py

+          pd.Timestamp('2010-01-01 12:00:00',
+                       tzinfo=pytz.FixedOffset(-60))]],
+        ['%Y-%m-%d %H:%M:%S %z',
+         ['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'],


Should this also work with %Z?
It seems that with datetime.datetime.strptime it does not work with either

The regex I pulled from https://github.com/python/cpython/blob/master/Lib/_strptime.py has an option for 'Z' with %z:

https://github.com/python/cpython/blob/483000e164ec68717d335767b223ae31b4b720cf/Lib/_strptime.py#L204

But %Z only makes timezones found in the system local time available, i.e. no 'Z' option.

https://github.com/python/cpython/blob/483000e164ec68717d335767b223ae31b4b720cf/Lib/_strptime.py#L210-L212

OK (that's probably a newer addition to python), then it makes sense to follow upstream python to be consistent

jorisvandenbossche · 2018-05-24T14:46:38Z

pandas/tests/indexes/datetimes/test_tools.py

+            pd.to_datetime(dates, format=fmt, box=box, utc=True)
+
+    @pytest.mark.parametrize('offset', [
+        '+0', '-1foo', 'UTCbar', ':10', '+01:000:01'])


Can you add an empty string here as well?

jreback · 2018-05-29T00:29:16Z

thanks @mroeschke nice patch!

betcha didn't think it would be this long when you first put it up! hahah. tests and code look great!

mroeschke · 2018-05-29T05:00:19Z

ha no problem, thanks!

…19979)

jreback requested changes Mar 5, 2018

View reviewed changes

mroeschke force-pushed the strftime_timezone branch from 9f273c4 to 7592ed8 Compare March 9, 2018 06:49

mroeschke force-pushed the strftime_timezone branch 2 times, most recently from a6c61d8 to 94641dc Compare March 13, 2018 04:00

jreback added Enhancement Datetime Datetime data dtype labels Mar 13, 2018

jreback requested changes Mar 13, 2018

View reviewed changes

mroeschke force-pushed the strftime_timezone branch from 94641dc to 9f38bda Compare March 15, 2018 20:13

jreback requested changes Mar 16, 2018

View reviewed changes

noemielteto and others added 7 commits March 18, 2018 16:40

DOC: update the Index.isin docstring (pandas-dev#20249)

4a43815

ENH: Parse %z directive in format for to_datetime

cb47c08

return timedeltas as list return timedeltas in a numpy array some flake fixes Extend logic of parsing timezones address comment misspelling Add additional tests address timezone localization

move parsing to a sub function, add additional test

f299aec

Address comments

259ec8f

timezone compat

77af4db

add empty line for strptime.pyx

54c2491

add issue number and try except

0e2a0cd

mroeschke force-pushed the strftime_timezone branch from 39d1ba4 to 0e2a0cd Compare March 19, 2018 01:34

TomAugspurger force-pushed the master branch from ee45e05 to 7273ea0 Compare March 19, 2018 12:59

mroeschke added 3 commits March 28, 2018 21:45

Merge remote-tracking branch 'upstream/master' into strftime_timezone

d31e141

Merge remote-tracking branch 'upstream/master' into strftime_timezone

7bdbdf4

add whatsnew

3e3d5c6

mroeschke changed the title ~~[WIP] ENH: Parse %z and %Z directive in format for to_datetime~~ ENH: Parse %z and %Z directive in format for to_datetime Mar 30, 2018

mroeschke added 4 commits March 30, 2018 21:45

Merge remote-tracking branch 'upstream/master' into strftime_timezone

c16ef8c

remove weird pd file

6f0b7f0

Merge remote-tracking branch 'upstream/master' into strftime_timezone

0525823

Remove blank line

4c22808

mroeschke added 3 commits May 21, 2018 22:59

Add additional unbalanced colon test

9a2ea19

Merge remote-tracking branch 'upstream/master' into strftime_timezone

924859e

allow parsing of any pytz

a1599a0

jreback reviewed May 23, 2018

View reviewed changes

jreback requested changes May 23, 2018

View reviewed changes

mroeschke added 4 commits May 23, 2018 09:08

Merge remote-tracking branch 'upstream/master' into strftime_timezone

6c80c2e

move error handling

abccc3e

Lint

473a0f4

Small cleanup

ab0a692

jreback added this to the 0.23.1 milestone May 24, 2018

jreback approved these changes May 24, 2018

View reviewed changes

Lint

56fc683

jorisvandenbossche modified the milestones: 0.23.1, 0.24.0 May 24, 2018

jorisvandenbossche reviewed May 24, 2018

View reviewed changes

mroeschke added 6 commits May 24, 2018 22:10

Merge remote-tracking branch 'upstream/master' into strftime_timezone

85bd45e

Add additional test and move whatsnew to v0.24

eb2a661

Merge remote-tracking branch 'upstream/master' into strftime_timezone

5500ca8

Merge remote-tracking branch 'upstream/master' into strftime_timezone

0e0d0fd

Merge remote-tracking branch 'upstream/master' into strftime_timezone

34f638c

Add comment that FixedOffset(0) is UTC

757458d

jreback merged commit 7b1f9bf into pandas-dev:master May 29, 2018

mroeschke deleted the strftime_timezone branch May 29, 2018 04:50

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

ENH: Parse %z and %Z directive in format for to_datetime (pandas-dev#…

49ae4ea

…19979)

josham mentioned this pull request Jan 30, 2019

Timestamp.strptime %z not supported #25016

Closed

ENH: Parse %z and %Z directive in format for to_datetime #19979

ENH: Parse %z and %Z directive in format for to_datetime #19979

Conversation

mroeschke commented Mar 3, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 9, 2018 • edited Loading

Codecov Report

mroeschke commented Mar 9, 2018

mroeschke commented Mar 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented May 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 24, 2018

jorisvandenbossche commented May 24, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 29, 2018

mroeschke commented May 29, 2018

mroeschke commented Mar 3, 2018 •

edited

Loading

codecov bot commented Mar 9, 2018 •

edited

Loading