Skip to content

BUG: out-of-bounds pydatetime with errors='coerce' and non-ISO format coerces all elements to NaT #50265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -794,6 +794,7 @@ Datetimelike
- Bug in :func:`to_datetime` was raising ``ValueError`` when parsing empty string and non-ISO8601 format was passed. Now, empty strings will be parsed as :class:`NaT`, for compatibility with how is done for ISO8601 formats (:issue:`50251`)
- Bug in :class:`Timestamp` was showing ``UserWarning``, which was not actionable by users, when parsing non-ISO8601 delimited date strings (:issue:`50232`)
- Bug in :func:`to_datetime` was showing misleading ``ValueError`` when parsing dates with format containing ISO week directive and ISO weekday directive (:issue:`50308`)
- Bug in :func:`to_datetime` was not raising ``ValueError`` when invalid format was passed and ``errors`` was ``'ignore'`` or ``'coerce'`` (:issue:`50266`)
-

Timedelta
Expand Down
27 changes: 8 additions & 19 deletions pandas/_libs/tslibs/strptime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ from pandas._libs.tslibs.np_datetime cimport (
pydate_to_dt64,
pydatetime_to_dt64,
)
from pandas._libs.tslibs.np_datetime import OutOfBoundsDatetime
from pandas._libs.tslibs.timestamps cimport _Timestamp
from pandas._libs.util cimport (
is_datetime64_object,
Expand Down Expand Up @@ -188,6 +189,7 @@ def array_strptime(

for i in range(n):
val = values[i]
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the code inside this try/except just indenting whats currently there or are there other changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's basically just that, but I also removed some

                if is_coerce:
                    iresult[i] = NPY_NAT
                    continue

as that already happens in the outer try-except

If you hide whitespace it'll be easier to see - you can also go to https://github.com/pandas-dev/pandas/pull/50265/files?diff=unified&w=1 (notice the ?diff=unified&w=1 at the end) to see it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'm not a big fan of try/excepts with a huge block of code inside, but I guess since the failures modes here should be made consistent with the broader to_datetime(errors=) keyword I suppose this is fine.

Question, sometime we are issues some ValueErrors internally in this function that will be caught by this try except e.g.

            if iso_year != -1:
                if iso_week == -1 or weekday == -1:
                    raise ValueError("ISO year directive '%G' must be used with "
                                     "the ISO week directive '%V' and a weekday "
                                     "directive '%A', '%a', '%w', or '%u'.")

Is this necessarily a "parsing" error that should be influenced by the errors keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rewrite it such that each check_dts_bounds is within its own try-except - maybe that's better actually, thanks

I'll get back to you on that ValueError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, refactored

That's a really good point about the "ISO year directive" error, thanks, good catch! So, I think that one should always raise. At least, that seems to be the current intention, there's no if is_coerce check there like there is for some other errors, such as

except ValueError:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise

So, I've parametrised test_error_iso_week_year too over errors to check that this one actually raises

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not actually opposed to the large try/except block if we expect ValueError and OutofBounds Errors to be affected by errors. If other excepts in this functions can be factored out (like the iso year directive one) and others then it would help ensure errors is consistently handled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks - have factored out the iso year directive check in #50309, so I've gone back to having an outer try-except block (so again, "hide-whitespace" will come in handy when reviewing)

if isinstance(val, str):
if len(val) == 0 or val in nat_strings:
iresult[i] = NPY_NAT
Expand Down Expand Up @@ -234,24 +236,15 @@ def array_strptime(
if exact:
found = format_regex.match(val)
if not found:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise ValueError(f"time data '{val}' does not match "
f"format '{fmt}' (match)")
if len(val) != found.end():
if is_coerce:
iresult[i] = NPY_NAT
continue
raise ValueError(f"unconverted data remains: {val[found.end():]}")

# search
else:
found = format_regex.search(val)
if not found:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise ValueError(f"time data {repr(val)} does not match format "
f"{repr(fmt)} (search)")

Expand Down Expand Up @@ -373,7 +366,6 @@ def array_strptime(
# Cannot pre-calculate date() since can change in Julian
# calculation and thus could have different value for the day of the wk
# calculation.
try:
if julian == -1:
# Need to add 1 to result since first day of the year is 1, not
# 0.
Expand All @@ -387,11 +379,6 @@ def array_strptime(
year = datetime_result.year
month = datetime_result.month
day = datetime_result.day
except ValueError:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise
if weekday == -1:
weekday = date(year, month, day).weekday()

Expand All @@ -405,15 +392,17 @@ def array_strptime(
dts.ps = ns * 1000

iresult[i] = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
try:
check_dts_bounds(&dts)
except ValueError:

result_timezone[i] = tz

except (ValueError, OutOfBoundsDatetime):
if is_coerce:
iresult[i] = NPY_NAT
continue
elif is_raise:
raise

result_timezone[i] = tz
return values, []

return result, result_timezone.base

Expand Down
24 changes: 1 addition & 23 deletions pandas/core/tools/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,29 +483,7 @@ def _array_strptime_with_fallback(
"""
Call array_strptime, with fallback behavior depending on 'errors'.
"""
try:
result, timezones = array_strptime(
arg, fmt, exact=exact, errors=errors, utc=utc
)
except OutOfBoundsDatetime:
if errors == "raise":
raise
if errors == "coerce":
result = np.empty(arg.shape, dtype="M8[ns]")
iresult = result.view("i8")
iresult.fill(iNaT)
else:
result = arg
except ValueError:
if errors == "raise":
raise
if errors == "coerce":
result = np.empty(arg.shape, dtype="M8[ns]")
iresult = result.view("i8")
iresult.fill(iNaT)
else:
result = arg
else:
result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors, utc=utc)
if any(tz is not None for tz in timezones):
return _return_parsed_timezone_results(result, timezones, utc, name)

Expand Down
50 changes: 41 additions & 9 deletions pandas/tests/tools/test_to_datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -744,8 +744,9 @@ def test_to_datetime_iso_week_year_format(self, s, _format, dt):
],
],
)
def test_error_iso_week_year(self, msg, s, _format):
# See GH#16607, #50308
@pytest.mark.parametrize("errors", ["raise", "coerce", "ignore"])
def test_error_iso_week_year(self, msg, s, _format, errors):
# See GH#16607, GH#50308
# This test checks for errors thrown when giving the wrong format
# However, as discussed on PR#25541, overriding the locale
# causes a different error to be thrown due to the format being
Expand All @@ -757,7 +758,7 @@ def test_error_iso_week_year(self, msg, s, _format):
"UTF-8",
):
with pytest.raises(ValueError, match=msg):
to_datetime(s, format=_format)
to_datetime(s, format=_format, errors=errors)

@pytest.mark.parametrize("tz", [None, "US/Central"])
def test_to_datetime_dtarr(self, tz):
Expand Down Expand Up @@ -1109,9 +1110,17 @@ def test_datetime_invalid_datatype(self, arg):
with pytest.raises(TypeError, match=msg):
to_datetime(arg)

@pytest.mark.parametrize("errors", ["coerce", "raise", "ignore"])
def test_invalid_format_raises(self, errors):
# https://github.com/pandas-dev/pandas/issues/50255
with pytest.raises(
ValueError, match="':' is a bad directive in format 'H%:M%:S%"
):
to_datetime(["00:00:00"], format="H%:M%:S%", errors=errors)

@pytest.mark.parametrize("value", ["a", "00:01:99"])
@pytest.mark.parametrize(
"format,warning", [(None, UserWarning), ("H%:M%:S%", None)]
"format,warning", [(None, UserWarning), ("%H:%M:%S", None)]
)
def test_datetime_invalid_scalar(self, value, format, warning):
Comment on lines -1114 to 1125
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test claims to be testing invalid scalars, but in reality it was erroring before the parsing even started because "H%:M%:S%" is an invalid format

So, I've changed the format to reflect what it looks like this test was meant to be testing, and have made a separate test above to check that invalid formats raise

# GH24763
Expand All @@ -1124,7 +1133,8 @@ def test_datetime_invalid_scalar(self, value, format, warning):
assert res is NaT

msg = (
"is a bad directive in format|"
"does not match format|"
"unconverted data remains:|"
"second must be in 0..59|"
f"Given date string {value} not likely a datetime"
)
Expand All @@ -1134,7 +1144,7 @@ def test_datetime_invalid_scalar(self, value, format, warning):

@pytest.mark.parametrize("value", ["3000/12/11 00:00:00"])
@pytest.mark.parametrize(
"format,warning", [(None, UserWarning), ("H%:M%:S%", None)]
"format,warning", [(None, UserWarning), ("%H:%M:%S", None)]
)
def test_datetime_outofbounds_scalar(self, value, format, warning):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above (this errors before even reaching the out-of-bounds scalar...)

# GH24763
Expand All @@ -1147,7 +1157,7 @@ def test_datetime_outofbounds_scalar(self, value, format, warning):
assert res is NaT

if format is not None:
msg = "is a bad directive in format|Out of bounds .* present at position 0"
msg = "does not match format|Out of bounds .* present at position 0"
with pytest.raises(ValueError, match=msg):
to_datetime(value, errors="raise", format=format)
else:
Expand All @@ -1159,7 +1169,7 @@ def test_datetime_outofbounds_scalar(self, value, format, warning):

@pytest.mark.parametrize("values", [["a"], ["00:01:99"], ["a", "b", "99:00:00"]])
@pytest.mark.parametrize(
"format,warning", [(None, UserWarning), ("H%:M%:S%", None)]
"format,warning", [(None, UserWarning), ("%H:%M:%S", None)]
)
def test_datetime_invalid_index(self, values, format, warning):
# GH24763
Expand All @@ -1172,7 +1182,8 @@ def test_datetime_invalid_index(self, values, format, warning):
tm.assert_index_equal(res, DatetimeIndex([NaT] * len(values)))

msg = (
"is a bad directive in format|"
"does not match format|"
"unconverted data remains:|"
f"Given date string {values[0]} not likely a datetime|"
"second must be in 0..59"
)
Expand Down Expand Up @@ -1312,6 +1323,27 @@ def test_to_datetime_coerce(self):
)
tm.assert_index_equal(result, expected)

@pytest.mark.parametrize(
"string_arg, format",
[("March 1, 2018", "%B %d, %Y"), ("2018-03-01", "%Y-%m-%d")],
)
@pytest.mark.parametrize(
"outofbounds",
[
datetime(9999, 1, 1),
date(9999, 1, 1),
np.datetime64("9999-01-01"),
"January 1, 9999",
"9999-01-01",
],
)
def test_to_datetime_coerce_oob(self, string_arg, format, outofbounds):
# https://github.com/pandas-dev/pandas/issues/50255
ts_strings = [string_arg, outofbounds]
result = to_datetime(ts_strings, errors="coerce", format=format)
expected = DatetimeIndex([datetime(2018, 3, 1), NaT])
tm.assert_index_equal(result, expected)

@pytest.mark.parametrize(
"errors, expected",
[
Expand Down