-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
used regular expression in format_is_iso()
#50468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
used regular expression in format_is_iso()
#50468
Conversation
pandas/_libs/tslibs/parsing.pyx
Outdated
iso_regex = \ | ||
r"^\d{4}(-\d{2}(-\d{2}(T\d{2}:\d{2}:\d{2}(\.\d+)?(([+-]\d{2}:\d{2})|Z)?)?)?)?$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this - to keep it readable, how about making this a verbose with comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli added some comments for explaining the regular expression and how it's evaluating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the link I provided, it explains how to write a regular expression with the verbose flag
Did you try running this locally?
Oh sorry didn't read the lin, let me fix it |
38bd5f0
to
77a8f02
Compare
pandas/_libs/tslibs/parsing.pyx
Outdated
r""" | ||
^ # start of string | ||
\d{4} # match a 4-digit year | ||
(-\d{2})? # optionally match a 2-digit month |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date_sep
isn't limited to -
import re
def format_is_iso(f: str) -> bool:
"""
Does format match the iso8601 set that can be handled by the C parser?
Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
but must be consistent. Leading 0s in dates and times are optional.
"""
iso_regex = re.compile(
r"""
^ # start of string
\d{4} # match a 4-digit year
(-\d{2})? # optionally match a 2-digit month
(-\d{2})? # optionally match a 2-digit day
(T\d{2}:\d{2}:\d{2} # match time in the format "THH:MM:SS"
(\.\d+)? # optionally match a decimal and fractional seconds
([+-]\d{2}:\d{2}|Z)?)? # optionally match a timezone in the format "+HH:MM" or "Z"
$ # end of string
""",
re.VERBOSE,
)
excluded_formats = ["%Y%m%d", "%Y%m", "%Y"]
if re.match(iso_regex, f) and f not in excluded_formats:
return True
return False
res = format_is_iso('2022-12-38T17:26:49Z')
print(res) |
2dfa309
to
f30c0c2
Compare
([ -/\\.]\d{2}|\d{2})? # optionally match a 2-digit month
([ -/\\.]\d{2}|\d{2})? # optionally match a 2-digit day
([ T]\d{2}:\d{2}:\d{2} # match time in the format "THH:MM:SS" Added some separators regex to separate the dates & time |
getting there, but it's not quite right,it's failing lots of tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are still failing
at a minimum, you need to get
pytest pandas/tests/tslibs/test_parsing.py
to pass - please make sure that this passes locally for you
Sorry about the inconvenience I didn't knew we can test this on local system |
@MarcoGorelli ran the test on the non-regular-expression version they are also failing. What should I do? any suggestions? |
did you compile the C extensions? https://pandas.pydata.org/docs/dev/development/contributing_environment.html#step-3-build-and-install-pandas |
Hey, @MarcoGorelli didn't understand why it's @pytest.mark.parametrize(
"fmt,expected",
[
("%Y %m %d %H:%M:%S", True),
("%Y/%m/%d %H:%M:%S", True),
(r"%Y\%m\%d %H:%M:%S", True),
("%Y-%m-%d %H:%M:%S", True),
("%Y.%m.%d %H:%M:%S", True),
("%Y%m%d %H:%M:%S", True),
("%Y-%m-%dT%H:%M:%S", True),
("%Y-%m-%dT%H:%M:%S%z", True),
("%Y-%m-%dT%H:%M:%S%Z", False),
("%Y-%m-%dT%H:%M:%S.%f", True),
("%Y-%m-%dT%H:%M:%S.%f%z", True),
("%Y-%m-%dT%H:%M:%S.%f%Z", False),
("%Y%m%d", False),
("%Y%m", False),
("%Y", False),
("%Y-%m-%d", True),
("%Y-%m", True),
],
)
def test_is_iso_format(fmt, expected):
# see gh-41047
result = parsing.format_is_iso(fmt)
assert result == expected Wrote the code with the dates in the format of the above and checked the result with the excepted output, the result output is matching with the expected out import re
def format_is_iso(f: str) -> bool:
"""
Does format match the iso8601 set that can be handled by the C parser?
Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
but must be consistent. Leading 0s in dates and times are optional.
"""
iso_regex = re.compile(
r"""
^ # start of string
\d{4} # match a 4-digit year
([ -/\\.]\d{2}|\d{2})? # optionally match a 2-digit month
([ -/\\.]\d{2}|\d{2})? # optionally match a 2-digit day
([ T]\d{2}:\d{2}:\d{2} # match time in the format "THH:MM:SS"
(\.\d+)? # match a decimal and fractional seconds
([+-]\d{2}:\d{2}|Z)?)? # match timezone in the format "+HH:MM" or "Z"
$ # end of string
""",
re.VERBOSE,
)
excluded_formats = [
r"^\d{4}\d{2}\d{2}$", # %Y%m%d
r"^\d{4}\d{2}$", # %Y%m
r"^\d{4}$", # %Y
]
if any(re.match(pattern, f) for pattern in excluded_formats):
return False
return bool(re.match(iso_regex, f))
dates = [
"2022 12 29 12:34:56",
"2022/12/29 12:34:56",
"2022\\12\\29 12:34:56",
"2022-12-29 12:34:56",
"2022.12.29 12:34:56",
"20221229 12:34:56",
"2022-12-29T12:34:56",
"2022-12-29T12:34:56+01:00",
"2022-12-29T12:34:56CET",
"2022-12-29T12:34:56.123456",
"2022-12-29T12:34:56.123456+01:00",
"2022-12-29T12:34:56.123456CET",
"20221229",
"202212",
"2022",
"2022-12-29",
"2022-12"
]
for date in dates:
res = format_is_iso(date)
print(date, res) But at the time of |
does the test pass on the |
The test got passed on the main branch with the |
then you need to fix up your regular expression 😄 |
this might help https://stackoverflow.com/a/74953592/4451315 |
Hey the upstream got the function remove, don't know what exactly happened |
it moved to a different file, if you do a search (e.g. with |
Nice! If that works, it looks good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this regex works, it looks good to me, thanks for sticking with it
Would be good to time it too
pandas/_libs/tslibs/strptime.pyx
Outdated
^ # start of string | ||
%Y # Year | ||
(?:([-/ \\.]?)%m # month with or without separators | ||
(?:\1%d # day with or without separators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps let's update the comment to "day with same separator as for year-month"
?
pandas/_libs/tslibs/strptime.pyx
Outdated
excluded_formats = [ | ||
r"^%Y%m$", | ||
] | ||
if any(re.match(pattern, f) for pattern in excluded_formats): | ||
return False | ||
return bool(re.match(iso_regex, f)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a regex for the excluded formats - would
return re.match(iso_regex, f) is not None and f not in excluded_formats
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea worked fine without the regex excluded_format
pandas/_libs/tslibs/strptime.pyx
Outdated
from pandas._libs.tslibs.np_datetime import OutOfBoundsDatetime | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you revert these unrelated changes please?
hey, by the way, |
it works for me, so my only guess is that it has to do with C extensions again - if you post your error (here or on slack) I'll take a look |
no worries - I suspect it's this #47305
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, seems to be working!
Couple of minor comments - then I'll time this and we'll see if we can get it in
Timings look fine: upstream/main: In [2]: %%timeit
...: format_is_iso('%Y-%m-%d %H:%M%S%z')
...:
...:
2.64 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) here: In [1]: from pandas._libs.tslibs.strptime import format_is_iso
In [2]: %%timeit
...: format_is_iso('%Y-%m-%d %H:%M%S%z')
...:
...:
1.39 µs ± 67.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) (note: I had to change I prefer it written like this anyway |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - @WillAyd seeing as you'd suggested this, any thoughts?
lgtm. Nice performance boost. Anything else left to do or want to take out of draft status? |
@ShashwatAgrawal20 if this is ready, could you mark the PR as ready for review please? https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review let's also add a whatsnew note, something like
|
|
That's right! This is better than the original btw, because this would fail on Leaving open a bit in case there's objections, then will merge |
wouldn't have been possible without your guidance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ShashwatAgrawal20
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.Added regular expressions in the
format_is_iso()
function