Skip to content

BUG: Behavior with fallback between raise and coerce #46071 #47745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -902,6 +902,7 @@ Datetimelike
- Bug in :meth:`DatetimeIndex.resolution` incorrectly returning "day" instead of "nanosecond" for nanosecond-resolution indexes (:issue:`46903`)
- Bug in :class:`Timestamp` with an integer or float value and ``unit="Y"`` or ``unit="M"`` giving slightly-wrong results (:issue:`47266`)
- Bug in :class:`.DatetimeArray` construction when passed another :class:`.DatetimeArray` and ``freq=None`` incorrectly inferring the freq from the given array (:issue:`47296`)
- Bug in :func:`to_datetime` where ``infer_datetime_format`` fallback would not run if ``errors=coerce`` (:issue:`46071`)
- Bug in :func:`to_datetime` where ``OutOfBoundsDatetime`` would be thrown even if ``errors=coerce`` if there were more than 50 rows (:issue:`45319`)
- Bug when adding a :class:`DateOffset` to a :class:`Series` would not add the ``nanoseconds`` field (:issue:`47856`)
-
Expand Down
8 changes: 7 additions & 1 deletion pandas/core/tools/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,9 @@ def _array_strptime_with_fallback(
if "%Z" in fmt or "%z" in fmt:
return _return_parsed_timezone_results(result, timezones, tz, name)

if infer_datetime_format and np.isnan(result).any():
return None

return _box_as_indexlike(result, utc=utc, name=name)


Expand Down Expand Up @@ -798,7 +801,10 @@ def to_datetime(
If :const:`True` and no `format` is given, attempt to infer the format
of the datetime strings based on the first non-NaN element,
and if it can be inferred, switch to a faster method of parsing them.
In some cases this can increase the parsing speed by ~5-10x.
In some cases this can increase the parsing speed by ~5-10x. If subsequent
datetime strings do not follow the inferred format, parsing will fall
back to the slower method of determining the format for each
string individually.
origin : scalar, default 'unix'
Define the reference date. The numeric values would be parsed as number
of units (defined by `unit`) since this reference date.
Expand Down
39 changes: 39 additions & 0 deletions pandas/tests/arrays/test_datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
"""
import operator

from dateutil.parser._parser import ParserError
import numpy as np
import pytest

from pandas._libs.tslibs import tz_compare
from pandas._libs.tslibs.dtypes import NpyDatetimeUnit
from pandas.errors import OutOfBoundsDatetime

from pandas.core.dtypes.dtypes import DatetimeTZDtype

Expand Down Expand Up @@ -639,3 +641,40 @@ def test_tz_localize_t2d(self):

roundtrip = expected.tz_localize("US/Pacific")
tm.assert_datetime_array_equal(roundtrip, dta)

@pytest.mark.parametrize(
"error",
["coerce", "raise"],
)
def test_coerce_fallback(self, error):
# GH#46071
s = pd.Series(["6/30/2025", "1 27 2024"])
expected = pd.Series(
[pd.Timestamp("2025-06-30 00:00:00"), pd.Timestamp("2024-01-27 00:00:00")]
)

result = pd.to_datetime(s, errors=error, infer_datetime_format=True)

if error == "coerce":
assert result[1] is not pd.NaT

tm.assert_series_equal(expected, result)

expected2 = pd.Series([pd.Timestamp("2000-01-01 00:00:00"), pd.NaT])

es1 = pd.Series(["1/1/2000", "7/12/1200"])
es2 = pd.Series(["1/1/2000", "Hello"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of Hello use Invalid date or something like that, so it's easier to understand what we're doing here.


if error == "coerce":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes @srotondo. This test is a bit complex and not very easy to read. I think we could make things more readable if we have if for example split this in different tests. Something like:

  • test_to_datetime_infer_datetime_format_with_different_formats
  • test_to_datetime_infer_datetime_format_with_nat
  • test_to_datetime_infer_datetime_format_with_out_of_bounds
  • test_to_datetime_infer_datetime_format_with_invalid_date

What do you think? Then each test would be quite easy to read. Or maybe by parametrizing the input and the two outputs (coerce and raise). This will avoid a bit of unrepeated code and be quite concise, but probably not so readable.

See what makes sense to you, but would be good to make things as simple as possible, as the pandas codebase is already huge and too complex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@datapythonista thank you for your feedback on the tests. Because the cases are all closely related, I think it will be easier to understand as one test, but I did simplify some of the test and added more comments to clarify parts of the test.

eres1 = pd.to_datetime(es1, errors=error, infer_datetime_format=True)
eres2 = pd.to_datetime(es2, errors=error, infer_datetime_format=True)
tm.assert_series_equal(expected2, eres1)
tm.assert_series_equal(expected2, eres2)
else:
with pytest.raises(
OutOfBoundsDatetime, match="Out of bounds nanosecond timestamp"
):
pd.to_datetime(es1, errors=error, infer_datetime_format=True)

with pytest.raises(ParserError, match="Unknown string format: Hello"):
pd.to_datetime(es2, errors=error, infer_datetime_format=True)