Skip to content

BUG: inconsistent handling of exact=False case in to_datetime parsing #50412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Dec 23, 2022 · 2 comments · Fixed by #50435
Closed

BUG: inconsistent handling of exact=False case in to_datetime parsing #50412

jorisvandenbossche opened this issue Dec 23, 2022 · 2 comments · Fixed by #50435
Labels
Bug Datetime Datetime data dtype

Comments

@jorisvandenbossche
Copy link
Member

From #50411 (comment)

The documentation of to_datetime says:

exact : bool, default True
Control how format is used:

  • If :const:True, require an exact format match.
  • If :const:False, allow the format to match anywhere in the target string.

Based on that, I would expect we don't use anything else in the string that doesn't match the format. But it seems this currently depends on the exact format being used:

>>> pd.to_datetime(["19/may/11 01:00:00"], format="%d/%b/%y", exact=False)
DatetimeIndex(['2011-05-19'], dtype='datetime64[ns]', freq=None)

>>> pd.to_datetime(["2011-05-19 01:00:00"], format="%Y-%m-%d", exact=False)
DatetimeIndex(['2011-05-19 01:00:00'], dtype='datetime64[ns]', freq=None)

(happens both on current main and on 1.5)

@jorisvandenbossche jorisvandenbossche added Bug Datetime Datetime data dtype labels Dec 23, 2022
@jorisvandenbossche
Copy link
Member Author

On main, with %Y%m%d special case, the remainder is not taken into account:

>>> pd.to_datetime(["20110519 01:00:00"], format="%Y%m%d", exact=False)
DatetimeIndex(['2011-05-19'], dtype='datetime64[ns]', freq=None)

but with the changes of #50242, it now actually also does (I suppose because the code path for this specific format was changed, cc @MarcoGorelli)

Another consequence of that the ISO? formats don't actually ignore the remainder of the string, is that it also raises if it is unparseable:

>>> pd.to_datetime(["2011-05-19 AAA"], format="%Y-%m-%d", exact=False)
...
ValueError: time data "2011-05-19 AAA" at position 0 doesn't match format "%Y-%m-%d"

So maybe it is just that the exact=False feature doesn't work / isn't implemented for ISO like formats?

@MarcoGorelli
Copy link
Member

nice catch, thanks! this is indeed a bug:

In [21]: pd.to_datetime(["2011-05-19 05:00:00"], format="%Y-%d-%m", exact=False)
Out[21]: DatetimeIndex(['2011-01-05'], dtype='datetime64[ns]', freq=None)

In [22]: pd.to_datetime(["2011-05-19 05:00:00"], format="%Y-%m-%d", exact=False)
Out[22]: DatetimeIndex(['2011-05-19 05:00:00'], dtype='datetime64[ns]', freq=None)

this needs fixing in the ISO path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants