Skip to content

ENH: to_datetime support iso week year (16607) #25541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 18, 2019

Conversation

RjLi13
Copy link
Contributor

@RjLi13 RjLi13 commented Mar 5, 2019

I messed up the rebase of the old PR #24844 and started this again on a clean state. Reviewers please review on this PR!

From old PR

I found the issue stagnant, but since the fix was already in place, I manually took the code @rosygupta made and applied it onto latest master. Rebase wasn't an option that I found would work since the file in question has been split into multiple. Let me know what else needs to be updated.

@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 5, 2019

Just some things from the previous PR. Up to date with master. Moved to 0.25 with up to date description. Added docstrings to the function calc_julian.

Things needed to be addressed

  • Locale tests not passing on Linux on Python 3.x environments with a locale override
    • zh_CN.UTF-8
    • it_IT.UTF-8
    • I need some help figuring out where the issue lies. It seems to be triggering the wrong error message @gfyoung
  • @jreback on splitting the test_tools testcase? I didn't quite understand the conversation.
  • Other comments made by @jreback

@gfyoung @jreback @WillAyd

@codecov
Copy link

codecov bot commented Mar 5, 2019

Codecov Report

Merging #25541 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25541      +/-   ##
==========================================
- Coverage   91.75%   91.75%   -0.01%     
==========================================
  Files         173      173              
  Lines       52960    52960              
==========================================
- Hits        48596    48595       -1     
- Misses       4364     4365       +1
Flag Coverage Δ
#multiple 90.33% <ø> (ø) ⬆️
#single 41.71% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 87.57% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 076b5a8...2758257. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 5, 2019

Codecov Report

Merging #25541 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25541      +/-   ##
==========================================
- Coverage   91.25%   91.25%   -0.01%     
==========================================
  Files         172      172              
  Lines       52977    52977              
==========================================
- Hits        48343    48342       -1     
- Misses       4634     4635       +1
Flag Coverage Δ
#multiple 89.82% <ø> (ø) ⬆️
#single 41.74% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/tools/datetimes.py 84.63% <ø> (ø) ⬆️
pandas/util/testing.py 88.98% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65c0441...eedf0e6. Read the comment docs.

@gfyoung
Copy link
Member

gfyoung commented Mar 5, 2019

@RjLi13 : With the branch rebased, are you now by any chance able to reproduce any of the test failures locally using the override methods that we discussed in the old PR?

@gfyoung gfyoung added Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype labels Mar 6, 2019
@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 6, 2019

@gfyoung Yes I get the same error as what the CI checks, using the zh_CN.UTF-8 locale override.

@gfyoung
Copy link
Member

gfyoung commented Mar 6, 2019

Okay, that's good. At least you can debug locally this time around.

@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 8, 2019

The code was not passing the CI checks for the locale builds because the regex could not match the correct string with the pattern generated from the passed format, and it threw a different error ending the code before it got to the correct error. See this code here:

found = format_regex.match(val)
if not found:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise ValueError("time data %r does not match "
"format %r (match)" % (values[i], fmt))

My tests for failures were not expecting to have that error thrown (see test_tools.py changes)

As for why the regex could not match the pattern, my guess is that the locale could not detect English, given the string to search for was '1999 Monday' and the regex pattern for 'zh_CN.utf8' was re.compile('(?P<G>\\d\\d\\d\\d)\\s+(?P<A>星期一|星期二|星期三|星期四|星期五|星期六|星期日)', re.IGNORECASE).

@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 8, 2019

As for the fix, I'm not quite sure what to do. The code works correctly, independent of the locale, because the format wanted for iso_time is %G-%V-%u, which given this snippet

base.__init__({
# The " \d" part of the regex is to make %c from ANSI C work
'd': r"(?P<d>3[0-1]|[1-2]\d|0[1-9]|[1-9]| [1-9])",
'f': r"(?P<f>[0-9]{1,9})",
'G': r"(?P<G>\d\d\d\d)",
'H': r"(?P<H>2[0-3]|[0-1]\d|\d)",
'I': r"(?P<I>1[0-2]|0[1-9]|[1-9])",
'j': (r"(?P<j>36[0-6]|3[0-5]\d|[1-2]\d\d|0[1-9]\d|00[1-9]|"
r"[1-9]\d|0[1-9]|[1-9])"),
'm': r"(?P<m>1[0-2]|0[1-9]|[1-9])",
'M': r"(?P<M>[0-5]\d|\d)",
'S': r"(?P<S>6[0-1]|[0-5]\d|\d)",
'u': r"(?P<u>[1-7])",
'U': r"(?P<U>5[0-3]|[0-4]\d|\d)",
'V': r"(?P<V>5[0-3]|0[1-9]|[1-4]\d|\d)",
'w': r"(?P<w>[0-6])",
# W is set below by using 'U'
'y': r"(?P<y>\d\d)",
# XXX: Does 'Y' need to worry about having less or more than
# 4 digits?
'Y': r"(?P<Y>\d\d\d\d)",
'z': r"(?P<z>[+-]\d\d:?[0-5]\d(:?[0-5]\d(\.\d{1,6})?)?|Z)",
'A': self.__seqToRE(self.locale_time.f_weekday, 'A'),
'a': self.__seqToRE(self.locale_time.a_weekday, 'a'),
'B': self.__seqToRE(self.locale_time.f_month[1:], 'B'),
'b': self.__seqToRE(self.locale_time.a_month[1:], 'b'),
'p': self.__seqToRE(self.locale_time.am_pm, 'p'),
'Z': self.__seqToRE(pytz.all_timezones, 'Z'),
'%': '%'})

shows the regex is static.

The iso code references the strptime function written for cpython here quite liberally.

My solution was essentially to check for the locales currently being overridden, and not run the ValueError tests for them. Alternative solutions include getting rid of checking the message matches for the ValueError, since both throw ValueErrors, or deleting the tests that use a format dependent on locale. I'm not sure how to proceed further.

@gfyoung @jreback @WillAyd

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loooks pretty good. ping when updated and green.

ISO weeks start on Mondays, with week 01 being the week containing 4 Jan.
ISO week days range from 1 (Monday) to 7 (Sunday).

:param iso_year: the year taken from format %G
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

def test_ValueError_iso_week_year(self, msg, s, _format):
# See GH#16607
if locale.getlocale() != ('zh_CN', 'UTF-8') and \
locale.getlocale() != ('it_IT', 'UTF-8'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use \ on the break, use parens instead.

name this test: test_error_iso_week_year

@jreback jreback added this to the 0.25.0 milestone Mar 10, 2019
@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 10, 2019

@jreback I addressed comments and rebased my branch to latest.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RjLi13 lgtm. just some small comments. I think we need to put a link to the stftime docs in pd.to_datetime doc-string (like we do in doc/source/user_guide/timeseries.rst). can do here or in a followup.

# However, as discussed on PR#25541, overriding the locale
# causes a different error to be thrown due to the format being
# locale specific, but the test data is in english.
# Thus, a hack I did was to only run this test if locale was not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove from here down (the comments) as it is enough by here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RjLi13 : Did you address this comment here?

Copy link
Contributor Author

@RjLi13 RjLi13 Mar 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung I believe I did. The comment for the function was much longer but I took @jreback 's feedback to mean everything below the line where he commented and deleted it. I don't have the commit log to show unfortunately that I wrote a few extra sentences to this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sentence starting with: "Thus, a hack I did was..." necessary? Comments regarding first-person events are generally omitted, or should be reworded to not use first-person.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung I can refactor to not use first person. If you have something in mind already let me know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could refactor to not use first person, that probably would be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 10, 2019

RR (Review Ready)

@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 16, 2019

@jreback @gfyoung @WillAyd could you continue your reviews before I rebase to latest master?

@RjLi13
Copy link
Contributor Author

RjLi13 commented Mar 17, 2019

@jreback @gfyoung @WillAyd RR

@jreback jreback merged commit e8d951d into pandas-dev:master Mar 18, 2019
@jreback
Copy link
Contributor

jreback commented Mar 18, 2019

thanks @RjLi13 really nice patch! keep em coming!

@RjLi13 RjLi13 deleted the iso_fix branch March 19, 2019 04:03
sighingnow added a commit to sighingnow/pandas that referenced this pull request Mar 20, 2019
* origin/master:
  DOC: clean bug fix section in whatsnew (pandas-dev#25792)
  DOC: Fixed PeriodArray api ref (pandas-dev#25526)
  Move locale code out of tm, into _config (pandas-dev#25757)
  Unpin pycodestyle (pandas-dev#25789)
  Add test for rdivmod on EA array (GH23287) (pandas-dev#24047)
  ENH: Support datetime.timezone objects (pandas-dev#25065)
  Cython language level 3 (pandas-dev#24538)
  API: concat on sparse values (pandas-dev#25719)
  TST: assert_produces_warning works with filterwarnings (pandas-dev#25721)
  make core.config self-contained (pandas-dev#25613)
  CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721)
  TST: Check pytables<3.5.1 when skipping (pandas-dev#25773)
  DOC: Fix typo in docstring of DataFrame.memory_usage  (pandas-dev#25770)
  Replace dicts with OrderedDicts in groupby aggregation functions (pandas-dev#25693)
  TST: Fixturize tests/frame/test_missing.py (pandas-dev#25640)
  DOC: Improve the docsting of Series.iteritems (pandas-dev#24879)
  DOC: Fix function name. (pandas-dev#25751)
  Implementing iso_week_year support for to_datetime (pandas-dev#25541)
  DOC: clarify corr behaviour when using a callable (pandas-dev#25732)
  remove unnecessary check_output (pandas-dev#25755)

# Conflicts:
#	doc/source/whatsnew/v0.25.0.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_datetime should support ISO week year
3 participants