-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: make infer_datetime_format strict #48596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1, but if I'm not missing anything, I think there are two different problems, being how misleading is the parameter name the main one. See this example: >>> pandas.to_datetime(['01-31-2000', '31-01-2000'], infer_datetime_format=False, errors='raise')
DatetimeIndex(['2000-01-31', '2000-01-31'], dtype='datetime64[ns]', freq=None)
>>> pandas.to_datetime(['01-31-2000', '31-01-2000'], infer_datetime_format=True, errors='raise')
DatetimeIndex(['2000-01-31', '2000-01-31'], dtype='datetime64[ns]', freq=None) In both cases it's inferring the datetime format, it's just that the second implementation is faster. I think if the parameter name was The second problem and very coupled to the optimized implementation is that there is no way to tell pandas to just use the first inferred format. I think the parameter name So, +1 on your change, but besides it, I think we should also rename the parameter. |
Thanks for weighing in
wouldn't the following patch accomplish that? diff --git a/pandas/core/tools/datetimes.py b/pandas/core/tools/datetimes.py
index 78e12c96ce..77bb5a1bcf 100644
--- a/pandas/core/tools/datetimes.py
+++ b/pandas/core/tools/datetimes.py
@@ -485,18 +485,14 @@ def _array_strptime_with_fallback(
# if fmt was inferred, try falling back
# to array_to_datetime - terminate here
# for specified formats
- if not infer_datetime_format:
- if errors == "raise":
- raise
- elif errors == "coerce":
- result = np.empty(arg.shape, dtype="M8[ns]")
- iresult = result.view("i8")
- iresult.fill(iNaT)
- else:
- result = arg
+ if errors == "raise":
+ raise
+ elif errors == "coerce":
+ result = np.empty(arg.shape, dtype="M8[ns]")
+ iresult = result.view("i8")
+ iresult.fill(iNaT)
else:
- # Indicates to the caller to fallback to objects_to_datetime64ns
- return None
+ result = arg
else:
if "%Z" in fmt or "%z" in fmt:
return _return_parsed_timezone_results(result, timezones, tz, name) e.g. with the snippet you posted:
I think people are already using this parameter expecting it to force the format to be the same row all rows, so not sure about renaming |
I need to check the code in more detail to see if the patch would work for all cases, but for the case I mentioned seems to do the job. I'm +1 of making the change you propose (with the original parameter name). I do think that the parameter is misleading, and should better be renamed. And feels like adding a new parameter could make the transition easier: We don't change the behavior, but we can let the users know about the new parameter in the deprecation warning, in case forcing formats is what they were looking for. But in any case, that's a related but different discussion, so from my side feel free to move forward with the current parameter name, if you think it's better. |
to_datetime has an argument
infer_datetime_format
which, if set toTrue
, will guess the format from the first non-NaN row.People (users, but also core devs, e.g. here and here), expect that the format inferred from the first row will be applied to the rest of the series. i.e. that the following two should behave the same:
However, they don't: the latter raises, whilst the first one swaps format midway.
Although this is documented in the user guide, it's not what people expect.
Making this argument strict would align more to people's expectations, but also simplify the codebase, as it would get rid of special-casing such as
pandas/pandas/core/tools/datetimes.py
Lines 488 to 499 in ac648ee
TL;RD I'm suggesting that when using
infer_datetime_format=True
, the format detected from the first non-NaN value should be used to parse the rest of the Series, exactly as if the user had passed it toformat=
This would be one step towards addressing #12585
@pandas-dev/pandas-core any thoughts here?
EDIT: I'm hoping that #48621 can supersede this
The text was updated successfully, but these errors were encountered: