Skip to content

Datetime parsing (PDEP-4): should those changes apply to DatetimeIndex(strings) as well? #50894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Jan 20, 2023 · 3 comments
Labels
Datetime Datetime data dtype

Comments

@jorisvandenbossche
Copy link
Member

Moving this from #50411 (comment) to its own issue.

PDEP-4 (http://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html) has made the string->datetime parsing stricter. For example, using the example from the PDEP on the latest main branch:

In [1]: pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
...
ValueError: time data "13-01-2000 00:00:00" doesn't match format "%m-%d-%Y %H:%M:%S", at position 1

The above parses fine on pandas < 2, but parses both strings differently (month first vs day first), and so with latest pandas the above now raises an error.

But if I pass those strings to the DatetimeIndex(..) constructor, we still happily parse this inconsistently:

In [2]: pd.DatetimeIndex(['12-01-2000 00:00:00', '13-01-2000 00:00:00'])
Out[2]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

I don't think this was discussed in the PDEP discussion, but if we are changing this, shouldn't we make the parsing in DatetimeIndex consistent as well?

(sidenote: long term might want to make parsing in DatetimeIndex even stricter, as brought up in #50411 (comment) and comments below, eg only accepting ISO strings, or not even accepting strings at all, since you can use to_datetime instead. But let's leave that discussion for a separate thread, since that's something that can be deprecated, and not necessary for 2.0)

cc @MarcoGorelli

@jorisvandenbossche jorisvandenbossche added the Datetime Datetime data dtype label Jan 20, 2023
@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Jan 20, 2023
@jorisvandenbossche
Copy link
Member Author

Another note: I am writing this issue naively from observing the user behaviour. I am not very up to date on how DatetimeIndex(..) string parsing is implemented, and to what extent it has separate code paths for the actual parsing compared to to_datetime. So I don't know how straightforward it would be to make both parse strings similarly (for example, are there other aspects where DatetimeIndex(..) might behave differently when passed strings?)

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel brought up the good point that there are some inconsistencies in keywords / capabilities between DatetimeIndex() and to_datetime() that complicates this.

DatetimeIndex generally doesn't support all keywords of to_datetime (which is good, I think, as people should use to_datetime for advanced string parsing. Specifically, it doesn't have the formatkeyword), but it also does have some keywords that to_datetime does not have:

  • The tz keyword (to_datetime only has utc=True/False).
    For example, this allows pd.DatetimeIndex(["2022-01-01 09:00"], tz="Europe/Brussels") while with to_datetime you need to do pd.to_datetime(["2022-01-01 09:00:00"]).tz_localize("Europe/Brussels") (I seem to remember that in the past we have said that the latter is better for to_datetime because it is more explicit)
  • Related to tz, the ambiguous keyword (I assume this works similarly as the keyword in tz_localize)
  • freq keyword, so you can infer or set the frequency. With to_datetime you need to do that as a subsequent step.
  • dtype keyword, which allows to specify the resolution (this is new, so I wouldn't worry about this, and we need to add some way to to_datetime to achieve this as well)

This might make it harder to change DatetimeIndex (i.e. also make its string parsing strict), as it doesn't have the format keyword for the cases when the format can't be inferred, and you still want to make use of one of those keywords of DatetimeIndex.
Although I would argue that each of those keywords has a decent alternative with to_datetime. If you have a case where we start raising an error in 2.0, you will need to update your code anyway to accommodate this, so then you can also change it to use to_datetime if the fix you need is to specify the format.

@MarcoGorelli
Copy link
Member

removing the 2.0 milestone, but there's plenty of scope to take pdep4 further for 3.0 and beyond

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

2 participants