Skip to content

ENH: make infer_datetime_format strict #48596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Sep 17, 2022 · 4 comments
Closed

ENH: make infer_datetime_format strict #48596

MarcoGorelli opened this issue Sep 17, 2022 · 4 comments
Labels
Datetime Datetime data dtype Needs Discussion Requires discussion from core team before further action

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 17, 2022

to_datetime has an argument infer_datetime_format which, if set to True, will guess the format from the first non-NaN row.

People (users, but also core devs, e.g. here and here), expect that the format inferred from the first row will be applied to the rest of the series. i.e. that the following two should behave the same:

pd.to_datetime(['01-31-2000', '20-01-2000'], infer_datetime_format=True)
pd.to_datetime(['01-31-2000', '20-01-2000'], format='%m-%d-%Y')

However, they don't: the latter raises, whilst the first one swaps format midway.

Although this is documented in the user guide, it's not what people expect.

Making this argument strict would align more to people's expectations, but also simplify the codebase, as it would get rid of special-casing such as

if not infer_datetime_format:
if errors == "raise":
raise
elif errors == "coerce":
result = np.empty(arg.shape, dtype="M8[ns]")
iresult = result.view("i8")
iresult.fill(iNaT)
else:
result = arg
else:
# Indicates to the caller to fallback to objects_to_datetime64ns
return None

TL;RD I'm suggesting that when using infer_datetime_format=True, the format detected from the first non-NaN value should be used to parse the rest of the Series, exactly as if the user had passed it to format=

This would be one step towards addressing #12585

@pandas-dev/pandas-core any thoughts here?


EDIT: I'm hoping that #48621 can supersede this

@MarcoGorelli MarcoGorelli added Datetime Datetime data dtype Needs Discussion Requires discussion from core team before further action labels Sep 17, 2022
@datapythonista
Copy link
Member

+1, but if I'm not missing anything, I think there are two different problems, being how misleading is the parameter name the main one. See this example:

>>> pandas.to_datetime(['01-31-2000', '31-01-2000'], infer_datetime_format=False, errors='raise')
DatetimeIndex(['2000-01-31', '2000-01-31'], dtype='datetime64[ns]', freq=None)

>>> pandas.to_datetime(['01-31-2000', '31-01-2000'], infer_datetime_format=True, errors='raise')
DatetimeIndex(['2000-01-31', '2000-01-31'], dtype='datetime64[ns]', freq=None)

In both cases it's inferring the datetime format, it's just that the second implementation is faster. I think if the parameter name was use_optimization, then there wouldn't be any problem with user expectations.

The second problem and very coupled to the optimized implementation is that there is no way to tell pandas to just use the first inferred format. I think the parameter name infer_datetime_format=True is still misleading to describe this. Something like force_same_format=True would make more sense to me. I think then it'd be very clear that if formats are different the function will raise (or return NaT if coerce is used...).

So, +1 on your change, but besides it, I think we should also rename the parameter.

@MarcoGorelli
Copy link
Member Author

Thanks for weighing in

The second problem and very coupled to the optimized implementation is that there is no way to tell pandas to just use the first inferred format

wouldn't the following patch accomplish that?

diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index 78e12c96ce..77bb5a1bcf 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -485,18 +485,14 @@ def _array_strptime_with_fallback(
         # if fmt was inferred, try falling back
         # to array_to_datetime - terminate here
         # for specified formats
-        if not infer_datetime_format:
-            if errors == "raise":
-                raise
-            elif errors == "coerce":
-                result = np.empty(arg.shape, dtype="M8[ns]")
-                iresult = result.view("i8")
-                iresult.fill(iNaT)
-            else:
-                result = arg
+        if errors == "raise":
+            raise
+        elif errors == "coerce":
+            result = np.empty(arg.shape, dtype="M8[ns]")
+            iresult = result.view("i8")
+            iresult.fill(iNaT)
         else:
-            # Indicates to the caller to fallback to objects_to_datetime64ns
-            return None
+            result = arg
     else:
         if "%Z" in fmt or "%z" in fmt:
             return _return_parsed_timezone_results(result, timezones, tz, name)

e.g. with the snippet you posted:

In [1]: pandas.to_datetime(['01-31-2000', '31-01-2000'], infer_datetime_format=True, errors='raise')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [1], line 1
----> 1 pandas.to_datetime(['01-31-2000', '31-01-2000'], infer_datetime_format=True, errors='raise')

File ~/pandas-dev/pandas/core/tools/datetimes.py:1120, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1118         result = _convert_and_box_cache(argc, cache_array)
   1119     else:
-> 1120         result = convert_listlike(argc, format)
   1121 else:
   1122     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:431, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    428         format = None
    430 if format is not None:
--> 431     res = _to_datetime_with_format(
    432         arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
    433     )
    434     if res is not None:
    435         return res

File ~/pandas-dev/pandas/core/tools/datetimes.py:535, in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
    532         return _box_as_indexlike(result, utc=utc, name=name)
    534 # fallback
--> 535 res = _array_strptime_with_fallback(
    536     arg, name, tz, fmt, exact, errors, infer_datetime_format
    537 )
    538 return res

File ~/pandas-dev/pandas/core/tools/datetimes.py:474, in _array_strptime_with_fallback(arg, name, tz, fmt, exact, errors, infer_datetime_format)
    471 utc = tz == "utc"
    473 try:
--> 474     result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors)
    475 except OutOfBoundsDatetime:
    476     if errors == "raise":

File ~/pandas-dev/pandas/_libs/tslibs/strptime.pyx:145, in pandas._libs.tslibs.strptime.array_strptime()
    143         iresult[i] = NPY_NAT
    144         continue
--> 145     raise ValueError(f"time data '{val}' does not match "
    146                      f"format '{fmt}' (match)")
    147 if len(val) != found.end():

ValueError: time data '31-01-2000' does not match format '%m-%d-%Y' (match)

I think people are already using this parameter expecting it to force the format to be the same row all rows, so not sure about renaming

@datapythonista
Copy link
Member

I need to check the code in more detail to see if the patch would work for all cases, but for the case I mentioned seems to do the job.

I'm +1 of making the change you propose (with the original parameter name). I do think that the parameter is misleading, and should better be renamed. And feels like adding a new parameter could make the transition easier: We don't change the behavior, but we can let the users know about the new parameter in the deprecation warning, in case forcing formats is what they were looking for. But in any case, that's a related but different discussion, so from my side feel free to move forward with the current parameter name, if you think it's better.

@MarcoGorelli
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants