Skip to content

CI: avoid guess_datetime_format failure on 29th of Feburary #57674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 5, 2024

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Feb 29, 2024

# same default used by dateutil
default = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
default = datetime(1970, 1, 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the guessed format is checked explicitly against the input:

try:
array_strptime(np.asarray([dt_str], dtype=object), guessed_format)
except ValueError:
# Doesn't parse, so this can't be the correct format.
return None
# rebuild string, capturing any inferred padding
dt_str = "".join(tokens)
if parsed_datetime.strftime(guessed_format) == dt_str:
_maybe_warn_about_dayfirst(guessed_format, dayfirst)
return guessed_format
else:
return None

so this should be safe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a comment about why this default was chosen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks

@MarcoGorelli MarcoGorelli marked this pull request as ready for review February 29, 2024 09:57
@MarcoGorelli MarcoGorelli added the CI Continuous Integration label Feb 29, 2024
@MarcoGorelli MarcoGorelli added this to the 2.2.2 milestone Feb 29, 2024
@MarcoGorelli MarcoGorelli changed the title Fix ci CI: avoid guess_datetime_format failure on 29th of Feburary Feb 29, 2024
@jbrockmendel
Copy link
Member

Looks reasonable. I guess it would be a PITA to write a test?

@MarcoGorelli
Copy link
Member Author

Looks reasonable. I guess it would be a PITA to write a test?

given that the function calls datetime.now from Cython - yes, probably

# Use this instead of the dateutil default of
# `datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)`
# as that causes issues on the 29th of February.
default = datetime(1970, 1, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to understand what's going on. Surely happy to get this merged if it fixes the problem and you understand what's going on.

The root problem seems to be that guess_datetime_format('2003') is None instead of %Y. That would be easy to test if we want. What I don't understand is why changing the day and the month in the parsed date is relevant if the only token we have is the year...

>>> dateutil.parser.parse('2003', dayfirst=True, default=datetime.datetime.now().replace(hour=0, minute=0, second=0, microsecond=0))
datetime.datetime(2003, 2, 28, 0, 0)
>>> dateutil.parser.parse('2003', dayfirst=True, default=datetime.datetime(1970, 1, 1))
datetime.datetime(2003, 1, 1, 0, 0)

I didn't debug it, I guess you did, but I can't see why in the implementation it should make a difference. As said, if you are confident that this is a good solution and the algorithm is relivable happy to get this merged. But seems like there is a bug in the algorithm below we are not preventing/hacking, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah seems like there's something in pandas' vendored code which makes dateutil_parse fail, even though non-vendored dateutil itself parses it. that's the problem with vendoring I guess...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista - currently in main we take today's date, replace it with the bits that the user passed, and then check if it's a valid date. E.g. "2002" on 2/29 becomes "2002-02-29" which is not a valid date. From this we conclude we've somehow guessed wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both. I see, I assumed dateutil_parse was an alias to the dateutil parser. I see the copied version has the comment lifted from dateutil to get resolution, maybe the dateutil parser works at second resolution and we work at microsecond or nanosecond, and that's why the function was copied to pandas?

In any case I think the fix is fine, a function to get the format from a partial date is surely not going to be perfect anyway.

@mroeschke
Copy link
Member

Looks good. Could you double check this typing issue? https://github.com/pandas-dev/pandas/actions/runs/8108693644/job/22162385490?pr=57674

@MarcoGorelli
Copy link
Member Author

thanks - that one should be addressed in #57689

@mroeschke mroeschke modified the milestones: 2.2.2, 3.0 Mar 5, 2024
@mroeschke mroeschke added the Testing pandas testing functions or related to the test suite label Mar 5, 2024
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Going to retarget for 3.0 since the next time this will fail is in 4 years by which 2.2.x will be unsupported :)

@mroeschke mroeschke merged commit 83112d7 into pandas-dev:main Mar 5, 2024
@mroeschke
Copy link
Member

Thanks @MarcoGorelli

pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: List of years (as string) raises UserWarning with to_datetime
5 participants