Skip to content

used regular expression in format_is_iso() #50468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ShashwatAgrawal20
Copy link
Contributor

@ShashwatAgrawal20 ShashwatAgrawal20 commented Dec 28, 2022

Added regular expressions in the format_is_iso() function

def format_is_iso(f: str) -> bint:
    """
    Does format match the iso8601 set that can be handled by the C parser?
    Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
    but must be consistent.  Leading 0s in dates and times are optional.
    """
    iso_regex = \
        r"^\d{4}(-\d{2}(-\d{2}(T\d{2}:\d{2}:\d{2}(\.\d+)?(([+-]\d{2}:\d{2})|Z)?)?)?)?$"
    excluded_formats = ["%Y%m%d", "%Y%m", "%Y"]
    if re.match(iso_regex, f) and f not in excluded_formats:
        return True
    return False

Comment on lines 827 to 828
iso_regex = \
r"^\d{4}(-\d{2}(-\d{2}(T\d{2}:\d{2}:\d{2}(\.\d+)?(([+-]\d{2}:\d{2})|Z)?)?)?)?$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this - to keep it readable, how about making this a verbose with comments?

see https://stackoverflow.com/a/13852096/4451315

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli added some comments for explaining the regular expression and how it's evaluating

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the link I provided, it explains how to write a regular expression with the verbose flag

Did you try running this locally?

@ShashwatAgrawal20
Copy link
Contributor Author

Oh sorry didn't read the lin, let me fix it

r"""
^ # start of string
\d{4} # match a 4-digit year
(-\d{2})? # optionally match a 2-digit month
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date_sep isn't limited to -

@ShashwatAgrawal20
Copy link
Contributor Author

import re

def format_is_iso(f: str) -> bool:
    """
    Does format match the iso8601 set that can be handled by the C parser?
    Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
    but must be consistent.  Leading 0s in dates and times are optional.
    """
    iso_regex = re.compile(
        r"""
        ^                         # start of string
        \d{4}                     # match a 4-digit year
        (-\d{2})?                 # optionally match a 2-digit month
        (-\d{2})?                 # optionally match a 2-digit day
        (T\d{2}:\d{2}:\d{2}       # match time in the format "THH:MM:SS"
        (\.\d+)?                  # optionally match a decimal and fractional seconds
        ([+-]\d{2}:\d{2}|Z)?)?    # optionally match a timezone in the format "+HH:MM" or "Z"
        $                         # end of string
        """,
        re.VERBOSE,
    )
    excluded_formats = ["%Y%m%d", "%Y%m", "%Y"]
    if re.match(iso_regex, f) and f not in excluded_formats:
        return True
    return False


res = format_is_iso('2022-12-38T17:26:49Z')
print(res)

@ShashwatAgrawal20
Copy link
Contributor Author

        ([ -/\\.]\d{2}|\d{2})?  # optionally match a 2-digit month
        ([ -/\\.]\d{2}|\d{2})?  # optionally match a 2-digit day
        ([ T]\d{2}:\d{2}:\d{2}  # match time in the format "THH:MM:SS"

Added some separators regex to separate the dates & time

@MarcoGorelli
Copy link
Member

getting there, but it's not quite right,it's failing lots of tests

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are still failing

at a minimum, you need to get

pytest pandas/tests/tslibs/test_parsing.py

to pass - please make sure that this passes locally for you

@ShashwatAgrawal20
Copy link
Contributor Author

tests are still failing

at a minimum, you need to get

pytest pandas/tests/tslibs/test_parsing.py

to pass - please make sure that this passes locally for you

Sorry about the inconvenience I didn't knew we can test this on local system

@ShashwatAgrawal20
Copy link
Contributor Author

@MarcoGorelli ran the test on the non-regular-expression version they are also failing. What should I do? any suggestions?
image

@MarcoGorelli
Copy link
Member

@ShashwatAgrawal20
Copy link
Contributor Author

@MarcoGorelli ran the test on the non-regular-expression version they are also failing. What should I do? any suggestions? image

Hey, @MarcoGorelli didn't understand why it's assert False == True in all. It should compare the expected value with the result value.
As in the file pandas/tests/tslibs/test_parsing.py

@pytest.mark.parametrize(
    "fmt,expected",
    [
        ("%Y %m %d %H:%M:%S", True),
        ("%Y/%m/%d %H:%M:%S", True),
        (r"%Y\%m\%d %H:%M:%S", True),
        ("%Y-%m-%d %H:%M:%S", True),
        ("%Y.%m.%d %H:%M:%S", True),
        ("%Y%m%d %H:%M:%S", True),
        ("%Y-%m-%dT%H:%M:%S", True),
        ("%Y-%m-%dT%H:%M:%S%z", True),
        ("%Y-%m-%dT%H:%M:%S%Z", False),
        ("%Y-%m-%dT%H:%M:%S.%f", True),
        ("%Y-%m-%dT%H:%M:%S.%f%z", True),
        ("%Y-%m-%dT%H:%M:%S.%f%Z", False),
        ("%Y%m%d", False),
        ("%Y%m", False),
        ("%Y", False),
        ("%Y-%m-%d", True),
        ("%Y-%m", True),
    ],
)
def test_is_iso_format(fmt, expected):
    # see gh-41047
    result = parsing.format_is_iso(fmt)
    assert result == expected

Wrote the code with the dates in the format of the above and checked the result with the excepted output, the result output is matching with the expected out

import re


def format_is_iso(f: str) -> bool:
    """
    Does format match the iso8601 set that can be handled by the C parser?
    Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different
    but must be consistent.  Leading 0s in dates and times are optional.
    """
    iso_regex = re.compile(
        r"""
        ^                       # start of string
        \d{4}                   # match a 4-digit year
        ([ -/\\.]\d{2}|\d{2})?  # optionally match a 2-digit month
        ([ -/\\.]\d{2}|\d{2})?  # optionally match a 2-digit day
        ([ T]\d{2}:\d{2}:\d{2}  # match time in the format "THH:MM:SS"
        (\.\d+)?                # match a decimal and fractional seconds
        ([+-]\d{2}:\d{2}|Z)?)?  # match timezone in the format "+HH:MM" or "Z"
        $                       # end of string
        """,
        re.VERBOSE,
    )

    excluded_formats = [
        r"^\d{4}\d{2}\d{2}$",  # %Y%m%d
        r"^\d{4}\d{2}$",       # %Y%m
        r"^\d{4}$",            # %Y
    ]

    if any(re.match(pattern, f) for pattern in excluded_formats):
        return False

    return bool(re.match(iso_regex, f))


dates = [
    "2022 12 29 12:34:56",
    "2022/12/29 12:34:56",
    "2022\\12\\29 12:34:56",
    "2022-12-29 12:34:56",
    "2022.12.29 12:34:56",
    "20221229 12:34:56",
    "2022-12-29T12:34:56",
    "2022-12-29T12:34:56+01:00",
    "2022-12-29T12:34:56CET",
    "2022-12-29T12:34:56.123456",
    "2022-12-29T12:34:56.123456+01:00",
    "2022-12-29T12:34:56.123456CET",
    "20221229",
    "202212",
    "2022",
    "2022-12-29",
    "2022-12"
]

for date in dates:
    res = format_is_iso(date)
    print(date, res)

image

But at the time of pytest it fails again

@MarcoGorelli
Copy link
Member

does the test pass on the main branch? if not, then I guess that your pandas build isn't quite correct - check the installation instructions I sent earlier

@ShashwatAgrawal20
Copy link
Contributor Author

does the test pass on the main branch? if not, then I guess that your pandas build isn't quite correct - check the installation instructions I sent earlier

The test got passed on the main branch with the non-regular-expression.

@MarcoGorelli
Copy link
Member

then you need to fix up your regular expression 😄

@MarcoGorelli
Copy link
Member

this might help https://stackoverflow.com/a/74953592/4451315

@ShashwatAgrawal20
Copy link
Contributor Author

ShashwatAgrawal20 commented Dec 30, 2022

Hey the upstream got the function remove, don't know what exactly happened

@MarcoGorelli
Copy link
Member

it moved to a different file, if you do a search (e.g. with git grep) you'll find it

@MarcoGorelli
Copy link
Member

Nice! If that works, it looks good to me!

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this regex works, it looks good to me, thanks for sticking with it

Would be good to time it too

^ # start of string
%Y # Year
(?:([-/ \\.]?)%m # month with or without separators
(?:\1%d # day with or without separators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps let's update the comment to "day with same separator as for year-month"?

Comment on lines 82 to 87
excluded_formats = [
r"^%Y%m$",
]
if any(re.match(pattern, f) for pattern in excluded_formats):
return False
return bool(re.match(iso_regex, f))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a regex for the excluded formats - would

return re.match(iso_regex, f) is not None and f not in excluded_formats

work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea worked fine without the regex excluded_format

from pandas._libs.tslibs.np_datetime import OutOfBoundsDatetime

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you revert these unrelated changes please?

@ShashwatAgrawal20
Copy link
Contributor Author

Nice! If that works, it looks good to me!

hey, by the way, pytest pandas/tests/tslibs/test_parsing.py having some import error, is that on my device or it is actually an issue?

@MarcoGorelli
Copy link
Member

it works for me, so my only guess is that it has to do with C extensions again - if you post your error (here or on slack) I'll take a look

@ShashwatAgrawal20
Copy link
Contributor Author

it works for me, so my only guess is that it has to do with C extensions again - if you post your error (here or on slack) I'll take a look

Sorry for the inconvenience still a noob in the open source world, but learned a lot by your guidance

image

@MarcoGorelli
Copy link
Member

no worries - I suspect it's this #47305
I'd suggest cleaning and rebuilding:

git clean -fxfd pandas
python setup.py build_ext -j 4

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, seems to be working!

Couple of minor comments - then I'll time this and we'll see if we can get it in

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 1, 2023

Timings look fine:

upstream/main:

In [2]: %%timeit
   ...: format_is_iso('%Y-%m-%d %H:%M%S%z')
   ...: 
   ...: 
2.64 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

here:

In [1]:  from pandas._libs.tslibs.strptime import format_is_iso

In [2]: %%timeit
   ...: format_is_iso('%Y-%m-%d %H:%M%S%z')
   ...: 
   ...: 
1.39 µs ± 67.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

(note: I had to change cdef to cpdef in both cases to be able to run this test)

I prefer it written like this anyway

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - @WillAyd seeing as you'd suggested this, any thoughts?

@WillAyd
Copy link
Member

WillAyd commented Jan 1, 2023

lgtm. Nice performance boost. Anything else left to do or want to take out of draft status?

@MarcoGorelli
Copy link
Member

@ShashwatAgrawal20 if this is ready, could you mark the PR as ready for review please? https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review

let's also add a whatsnew note, something like

Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`)

@ShashwatAgrawal20 ShashwatAgrawal20 marked this pull request as ready for review January 2, 2023 11:16
@ShashwatAgrawal20
Copy link
Contributor Author

@ShashwatAgrawal20 if this is ready, could you mark the PR as ready for review please? https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review

let's also add a whatsnew note, something like

Performance improvement in :func:`to_datetime` when format is given or can be inferred (:issue:`50465`)

doc/source/whatsnew/v2.0.0.rst here ?

@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Jan 2, 2023
@MarcoGorelli MarcoGorelli added Datetime Datetime data dtype Performance Memory or execution speed performance labels Jan 2, 2023
@MarcoGorelli
Copy link
Member

That's right! This is better than the original btw, because this would fail on'%Y-%m-%d %H:', whereas the original would have accepted it

Leaving open a bit in case there's objections, then will merge

@ShashwatAgrawal20
Copy link
Contributor Author

That's right! This is better than the original btw, because this would fail on'%Y-%m-%d %H:', whereas the original would have accepted it

Leaving open a bit in case there's objections, then will merge

wouldn't have been possible without your guidance.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli MarcoGorelli merged commit 99859e4 into pandas-dev:main Jan 2, 2023
@ShashwatAgrawal20 ShashwatAgrawal20 deleted the regular-expression-50465 branch May 11, 2024 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF use regular expression in format_is_iso?
3 participants