-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Parse %z and %Z directive in format for to_datetime #19979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need tests
pandas/core/tools/datetimes.py
Outdated
@@ -344,7 +344,7 @@ def _convert_listlike(arg, box, format, name=None, tz=tz): | |||
if result is None: | |||
try: | |||
result = array_strptime(arg, format, exact=exact, | |||
errors=errors) | |||
errors=errors)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this changed?
pandas/_libs/tslibs/strptime.pyx
Outdated
z = z[:3] + z[4:] | ||
if len(z) > 5: | ||
if z[5] != ':': | ||
msg = "Unconsistent use of : in {0}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent
9f273c4
to
7592ed8
Compare
Codecov Report
@@ Coverage Diff @@
## master #19979 +/- ##
==========================================
+ Coverage 91.84% 91.84% +<.01%
==========================================
Files 153 153
Lines 49506 49516 +10
==========================================
+ Hits 45467 45477 +10
Misses 4039 4039
Continue to review full report at Codecov.
|
Summary of the logic thus far and open to feedback:
|
I should be able to transfer this logic into cython on the next iteration. |
a6c61d8
to
94641dc
Compare
pandas/_libs/tslibs/strptime.pyx
Outdated
if z == 'Z': | ||
gmtoff = 0 | ||
else: | ||
if z[3] == ':': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a check on the len upfront here
pandas/_libs/tslibs/strptime.pyx
Outdated
seconds = int(z[5:7] or 0) | ||
gmtoff = (hours * 60 * 60) + (minutes * 60) + seconds | ||
gmtoff_remainder = z[8:] | ||
# Pad to always return microseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line
pandas/_libs/tslibs/strptime.pyx
Outdated
@@ -281,6 +290,32 @@ def array_strptime(ndarray[object] values, object fmt, | |||
else: | |||
tz = value | |||
break | |||
elif parse_code == 19: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this whole parse to a function and just all it here (and return the values as a tuple)
pandas/core/tools/datetimes.py
Outdated
@@ -343,8 +346,74 @@ def _convert_listlike(arg, box, format, name=None, tz=tz): | |||
# fallback | |||
if result is None: | |||
try: | |||
result = array_strptime(arg, format, exact=exact, | |||
errors=errors) | |||
parsing_tzname = '%Z' in format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa, what do you need all this for???
@@ -183,6 +183,63 @@ def test_to_datetime_format_weeks(self, cache): | |||
for s, format, dt in data: | |||
assert to_datetime(s, format=format, cache=cache) == dt | |||
|
|||
@pytest.mark.skipif(not PY3, | |||
reason="datetime.timezone not supported in PY2") | |||
def test_to_datetime_parse_timezone(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import at the top
tm.assert_numpy_array_equal(result, expected) | ||
|
||
# %z and %Z parsing | ||
dates = ['2010-01-01 12:00:00 UTC +0100'] * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more checking for invalid (partially formed such as +0, +1foo, UTCbar
tm.assert_index_equal(result, expected) | ||
|
||
result = pd.to_datetime(dates, format=fmt, box=False) | ||
expected = np.array(expected_dates, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert_index_equal always
94641dc
to
9f38bda
Compare
@jreback I was able to simplify a lot of my logic ( I underestimated how I created separate functions to parse the timezone directive and then box the result. |
pandas/_libs/tslibs/strptime.pyx
Outdated
@@ -632,3 +655,48 @@ cdef _calc_julian_from_U_or_W(int year, int week_of_year, | |||
else: | |||
days_to_week = week_0_length + (7 * (week_of_year - 1)) | |||
return 1 + days_to_week + day_of_week | |||
|
|||
cdef _parse_timezone_directive(object z): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be de-privatized (no leading _); these modules are all private.
pandas/_libs/tslibs/strptime.pyx
Outdated
|
||
if z == 'Z': | ||
gmtoff = 0 | ||
gmtoff_fraction = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would just directly return for this case (0, 0)
then don't need the else
pandas/_libs/tslibs/strptime.pyx
Outdated
gmtoff = 0 | ||
gmtoff_fraction = 0 | ||
else: | ||
if z[3] == ':': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might need to wrap this entire block in a try/except if the string is not long enough (or check lengths for each sub-section) and the raise the appropriate error
pandas/core/tools/datetimes.py
Outdated
ts = tslib.Timestamp(res) | ||
ts = ts.tz_localize(tzoffset) | ||
tz_results.append(ts) | ||
tz_results = np.array(tz_results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you need all of this for, this is jumping thru a lot of hoops here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This elif
branch is:
- Creating a
pytz.FixedOffset
from the parsed offset - Creating a naive Timestamp, then localizing it to the
pytz.FixedOffset
(can't do it directly likeTimezone(res, tz=pytz.FixedOffset(...))
because of my realization from DOC: Clarify passing epoch timestamp to Timestamp with timezone. #20257)
@@ -25,6 +25,12 @@ | |||
from pandas import (isna, to_datetime, Timestamp, Series, DataFrame, | |||
Index, DatetimeIndex, NaT, date_range, compat) | |||
|
|||
if PY3: | |||
from datetime import timezone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a fixture for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a fixture for timezone.utc
, but I am testing parsing a custom timezone.
def test_to_datetime_parse_tzname_or_tzoffset(self, box, const, | ||
assert_equal, fmt, | ||
dates, expected_dates): | ||
# %z or %Z parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number
assert_equal, dates, | ||
expected_dates): | ||
# %z and %Z parsing | ||
fmt = '%Y-%m-%d %H:%M:%S %Z %z' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
return timedeltas as list return timedeltas in a numpy array some flake fixes Extend logic of parsing timezones address comment misspelling Add additional tests address timezone localization
39d1ba4
to
0e2a0cd
Compare
I've changed a couple of things after your review @jreback
|
pandas/_libs/tslibs/strptime.pyx
Outdated
dict found_key | ||
bint is_raise = errors=='raise' | ||
bint is_ignore = errors=='ignore' | ||
bint is_coerce = errors=='coerce' | ||
int ordinal | ||
dict _parse_code_table = {'y': 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could make this a module level variable
pandas/core/tools/datetimes.py
Outdated
raise ValueError("Cannot pass a tz argument when " | ||
"parsing strings with timezone " | ||
"information.") | ||
result, timezones = array_strptime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather do the error handling in the _return_parsed_timezone_results. This block is just very complicated and hard to grok
lgtm @mroeschke merge on green. |
New features are for 0.24.0, @mroeschke can you move the whatsnew? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition!
Added some comments
[pd.Timestamp('2010-01-01 12:00:00', tz='UTC'), | ||
pd.Timestamp('2010-01-01 12:00:00', tz='GMT'), | ||
pd.Timestamp('2010-01-01 12:00:00', tz='US/Pacific')]], | ||
['%Y-%m-%d %H:%M:%S %z', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you one of them, eg this one, without the space before the tz?
['%Y-%m-%d %H:%M:%S %z', | ||
['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'], | ||
[pd.Timestamp('2010-01-01 12:00:00', | ||
tzinfo=pytz.FixedOffset(0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be UTC or a fixed offset of 0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytz coerces a fixed offset of 0 to UTC
In [2]: pytz.FixedOffset(0)
Out[2]: <UTC>
But making it explicit here that %z should return pytz.FixedOffset(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the actual DatetimeIndex you get here has UTC timezone? OK, that's good! (but maybe add a small comment since I would not expect that)
pd.Timestamp('2010-01-01 12:00:00', | ||
tzinfo=pytz.FixedOffset(-60))]], | ||
['%Y-%m-%d %H:%M:%S %z', | ||
['2010-01-01 12:00:00 Z', '2010-01-01 12:00:00 Z'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also work with %Z
?
It seems that with datetime.datetime.strptime
it does not work with either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex I pulled from https://github.com/python/cpython/blob/master/Lib/_strptime.py has an option for 'Z' with %z
:
But %Z
only makes timezones found in the system local time available, i.e. no 'Z' option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK (that's probably a newer addition to python), then it makes sense to follow upstream python to be consistent
pd.to_datetime(dates, format=fmt, box=box, utc=True) | ||
|
||
@pytest.mark.parametrize('offset', [ | ||
'+0', '-1foo', 'UTCbar', ':10', '+01:000:01']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an empty string here as well?
thanks @mroeschke nice patch! betcha didn't think it would be this long when you first put it up! hahah. tests and code look great! |
ha no problem, thanks! |
git diff upstream/master -u -- "*.py" | flake8 --diff
The implimentiontion is near identical to https://github.com/python/cpython/blob/master/Lib/_strptime.py an currently works as
datetime.strptime
would:Currently, an offset needs to get passed (%z) in order for the tzname used be used (%Z).
I'd like to get feedback of what this function should return having parsed %z or %Z. It may be difficult to return a normal DatimeIndex/Series/array given the following edge cases:
[date] +0100, [date]. -0600, [date] +1530
)[date] UTC, [date]. EST, [date] CET
)I suppose the most agnostic thing to is to return an array of Timestamps?