Skip to content

BUG: Behavior with fallback between raise and coerce #46071 #47745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -902,6 +902,7 @@ Datetimelike
- Bug in :meth:`DatetimeIndex.resolution` incorrectly returning "day" instead of "nanosecond" for nanosecond-resolution indexes (:issue:`46903`)
- Bug in :class:`Timestamp` with an integer or float value and ``unit="Y"`` or ``unit="M"`` giving slightly-wrong results (:issue:`47266`)
- Bug in :class:`.DatetimeArray` construction when passed another :class:`.DatetimeArray` and ``freq=None`` incorrectly inferring the freq from the given array (:issue:`47296`)
- Bug in :func:`to_datetime` where ``infer_datetime_format`` fallback would not run if ``errors=coerce`` (:issue:`46071`)
- Bug in :func:`to_datetime` where ``OutOfBoundsDatetime`` would be thrown even if ``errors=coerce`` if there were more than 50 rows (:issue:`45319`)
- Bug when adding a :class:`DateOffset` to a :class:`Series` would not add the ``nanoseconds`` field (:issue:`47856`)
-
Expand Down
9 changes: 8 additions & 1 deletion pandas/core/tools/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,10 @@ def _array_strptime_with_fallback(
if "%Z" in fmt or "%z" in fmt:
return _return_parsed_timezone_results(result, timezones, tz, name)

if infer_datetime_format and np.isnan(result).any():
# Indicates to the caller to fallback to objects_to_datetime64ns
return None

return _box_as_indexlike(result, utc=utc, name=name)


Expand Down Expand Up @@ -798,7 +802,10 @@ def to_datetime(
If :const:`True` and no `format` is given, attempt to infer the format
of the datetime strings based on the first non-NaN element,
and if it can be inferred, switch to a faster method of parsing them.
In some cases this can increase the parsing speed by ~5-10x.
In some cases this can increase the parsing speed by ~5-10x. If subsequent
datetime strings do not follow the inferred format, parsing will fall
back to the slower method of determining the format for each
string individually.
origin : scalar, default 'unix'
Define the reference date. The numeric values would be parsed as number
of units (defined by `unit`) since this reference date.
Expand Down
46 changes: 46 additions & 0 deletions pandas/tests/arrays/test_datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
"""
import operator

from dateutil.parser._parser import ParserError
import numpy as np
import pytest

from pandas._libs.tslibs import tz_compare
from pandas._libs.tslibs.dtypes import NpyDatetimeUnit
from pandas.errors import OutOfBoundsDatetime

from pandas.core.dtypes.dtypes import DatetimeTZDtype

Expand Down Expand Up @@ -639,3 +641,47 @@ def test_tz_localize_t2d(self):

roundtrip = expected.tz_localize("US/Pacific")
tm.assert_datetime_array_equal(roundtrip, dta)

@pytest.mark.parametrize(
"error",
["coerce", "raise"],
)
def test_fallback_different_formats(self, error):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test duplicative of

@pytest.mark.parametrize(
"data",
[
["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"],
["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"],
],
)
def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data):
ser = Series(np.array(data))
# When the format is inconsistent, infer_datetime_format should just
# fallback to the default parsing
tm.assert_series_equal(
to_datetime(ser, infer_datetime_format=False, cache=cache),
to_datetime(ser, infer_datetime_format=True, cache=cache),
)

?

If so, why not just add

    @pytest.mark.parametrize('errors', ['coerce', 'raise'])

to that test?

# GH#46071
# 2 valid dates with different formats
# Should parse with no errors
s = pd.Series(["6/30/2025", "1 27 2024"])
expected = pd.Series(
[pd.Timestamp("2025-06-30 00:00:00"), pd.Timestamp("2024-01-27 00:00:00")]
)
result = pd.to_datetime(s, errors=error, infer_datetime_format=True)
tm.assert_series_equal(expected, result)

@pytest.mark.parametrize(
"dateseries",
[
pd.Series(["1/1/2000", "7/12/1200"]),
pd.Series(["1/1/2000", "Invalid input"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this duplicative of

@pytest.mark.parametrize("values", [["a"], ["00:01:99"], ["a", "b", "99:00:00"]])
@pytest.mark.parametrize("infer", [True, False])
@pytest.mark.parametrize("format", [None, "H%:M%:S%"])
def test_datetime_invalid_index(self, values, format, infer):
# GH24763
res = to_datetime(
values, errors="ignore", format=format, infer_datetime_format=infer
)
tm.assert_index_equal(res, Index(values))
res = to_datetime(
values, errors="coerce", format=format, infer_datetime_format=infer
)
tm.assert_index_equal(res, DatetimeIndex([NaT] * len(values)))
msg = (
"is a bad directive in format|"
f"Given date string {values[0]} not likely a datetime|"
"second must be in 0..59"
)
with pytest.raises(ValueError, match=msg):
to_datetime(
values, errors="raise", format=format, infer_datetime_format=infer
)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a lot of comments there, so I'm going to try to answer all these questions here in this response here. Please tell me if it's not clear which comment I'm talking about at any given point here.

I was asked about these last 4 comments before, so I'll link my previous response here so you can see it in case you missed it.

It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: test_datetime_invalid_scalar, test_datetime_outofbounds_scalar, and test_datetime_invalid_index. While they do share a few similarities, there are some important key differences between these tests and my added tests.

To start with, test_datetime_invalid_scalar and test_datetime_outofbounds_scalar only deal in scalar values, unlike my tests, which deal with lists. The difference from test_datetime_invalid_index lies largely in the data being tested. The list used in test_datetime_invalid_index is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important because infer_datetime_format only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.

In terms of your comments about the code changes and the first test comment, I agree with those changes and I can do that for sure. Also, in terms of test location, I'm fully open to changing that, since the choice I made in location was largely arbitrary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining and for your attention to detail, that's very clear! Apologies for having missed that, I'd looked through in-line comments but had missed this one in the main thread

For test location - does pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat look good to you?

Copy link
Member

@MarcoGorelli MarcoGorelli Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srotondo have you had a look at #48621?

First, because you've had a close look at this part of the code so I'd really value your input

Second, because if we did go ahead with that, then it'd make the work in this PR unnecessary

Copy link
Contributor Author

@srotondo srotondo Sep 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test location - does pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat look good to you?

To clarify, is this a new section in test_datetimes.py that you are suggesting, because as far as I can tell, this doesn't actually exist yet. Also, if we're going to keep the tests in this file, I guess it doesn't make much sense to change that test you mentioned at the very start since it's in a different file.

Copy link
Member

@MarcoGorelli MarcoGorelli Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to add new tests to the class which

@pytest.mark.parametrize(
"data",
[
["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"],
["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"],
],
)
def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data):
ser = Series(np.array(data))
# When the format is inconsistent, infer_datetime_format should just
# fallback to the default parsing
tm.assert_series_equal(
to_datetime(ser, infer_datetime_format=False, cache=cache),
to_datetime(ser, infer_datetime_format=True, cache=cache),
)

belongs to (TestToDatetimeInferFormat)

],
)
def test_fallback_with_errors_coerce(self, dateseries):
# GH#46071
# Invalid inputs
# Parsing should fail for the second element
expected = pd.Series([pd.Timestamp("2000-01-01 00:00:00"), pd.NaT])
result = pd.to_datetime(dateseries, errors="coerce", infer_datetime_format=True)
tm.assert_series_equal(expected, result)

def test_fallback_with_errors_raise(self):
# GH#46071
# Invalid inputs
# Parsing should fail for the second element
dates1 = pd.Series(["1/1/2000", "7/12/1200"])
with pytest.raises(
OutOfBoundsDatetime, match="Out of bounds nanosecond timestamp"
):
pd.to_datetime(dates1, errors="raise", infer_datetime_format=True)

dates2 = pd.Series(["1/1/2000", "Invalid input"])
with pytest.raises(ParserError, match="Unknown string format: Invalid input"):
pd.to_datetime(dates2, errors="raise", infer_datetime_format=True)