BUG: Behavior with fallback between raise and coerce #46071 #47745

srotondo · 2022-07-15T23:50:02Z

closes BUG: to_datetime does not raise when errors='raise'. #46071 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The problem reported by this issue is that success in parsing a date string depends on how errors are reported (coerce vs raise). How the error is reported should not affect whether or not a parsing error occurs. Initially it appears that experiment 1 should return an error (as experiment 3 does) because the second date string does not match the format established by the first string. However, there is a fallback mechanism that is supposed to try parsing again without the effect of the infer_datetime_format flag. So in fact, experiment 3 should succeed in parsing the date as experiment 1 does.

The problem is that the fallback mechanism only works when errors=raise. This bug fix applies the same logic when errors=coerce so that the fallback parsing will take place regardless of how errors are reported.

…#46071

datapythonista · 2022-07-21T04:50:28Z

Thanks for the PR @srotondo.

Do you mind reverting the unrelated changes (removing or adding blank lines) please? Also, we need to add a test that captures the problem, fails with the original code, and passes after your changes. The issue number, better have it as a comment in the test, not in the code.

Also, we need a release note with the change. You can have a look at any merged bug PRs as a reference.

…6071

srotondo · 2022-08-03T19:42:22Z

@datapythonista I've added the test and the release note to this code, is there anything else you would like me to do?

datapythonista

Sorry for the delay, and thanks for the changes @srotondo. Can you have a look at the comment please?

pandas/tests/arrays/test_datetimes.py

scott-r · 2022-08-10T19:48:43Z

The fallback behavior is apparent in the code but not mentioned in the documentation for infer_datetime_format.

infer_datetime_format: bool, default False
If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

I think it would be accurate to add one more sentence at the end:

If subsequent datetime strings do not follow the inferred format, parsing will fall back to the slower method of determining the format for each string individually.

That should be sufficient to remove the expectation that an error will be raised without committing to an exact description of the fallback parsing.

srotondo · 2022-08-16T04:45:38Z

@datapythonista I believe I have made the appropriate changes. Is there anything else?

datapythonista

Thanks @srotondo, this makes more sense. I suggest some changes to make code more readable, see what you can do, and I'll review again when it's easier to see every case being tested.

datapythonista · 2022-08-17T08:09:14Z

pandas/tests/arrays/test_datetimes.py

+        es1 = pd.Series(["1/1/2000", "7/12/1200"])
+        es2 = pd.Series(["1/1/2000", "Hello"])
+
+        if error == "coerce":


Thanks for the fixes @srotondo. This test is a bit complex and not very easy to read. I think we could make things more readable if we have if for example split this in different tests. Something like:

test_to_datetime_infer_datetime_format_with_different_formats

test_to_datetime_infer_datetime_format_with_nat

test_to_datetime_infer_datetime_format_with_out_of_bounds

test_to_datetime_infer_datetime_format_with_invalid_date

What do you think? Then each test would be quite easy to read. Or maybe by parametrizing the input and the two outputs (coerce and raise). This will avoid a bit of unrepeated code and be quite concise, but probably not so readable.

See what makes sense to you, but would be good to make things as simple as possible, as the pandas codebase is already huge and too complex.

@datapythonista thank you for your feedback on the tests. Because the cases are all closely related, I think it will be easier to understand as one test, but I did simplify some of the test and added more comments to clarify parts of the test.

datapythonista · 2022-08-17T08:09:45Z

pandas/tests/arrays/test_datetimes.py

+        expected2 = pd.Series([pd.Timestamp("2000-01-01 00:00:00"), pd.NaT])
+
+        es1 = pd.Series(["1/1/2000", "7/12/1200"])
+        es2 = pd.Series(["1/1/2000", "Hello"])


Maybe instead of Hello use Invalid date or something like that, so it's easier to understand what we're doing here.

…te-bug

srotondo · 2022-08-29T17:31:21Z

@datapythonista I believe that I have made all the changes you have requested, including simplifying the tests. Is there anything else you would like me to do?

datapythonista

Sorry, I continue to find this test too hard to read. And very hard to understand what's wrong when it fails. I need to insist on using good testing practices, and restructure in a way that make tests atomic and simple. Like not having if statements, not having more than one expected value (unless using parametrization)...

If you don't find a better, way, using the previously suggested tests is a good start.

srotondo · 2022-09-02T20:44:27Z

@datapythonista I've split up the test into multiple pieces like you asked.

datapythonista · 2022-09-06T14:15:55Z

Thanks @srotondo, this is more readable now.

But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!

srotondo · 2022-09-06T17:07:23Z

But looks like you're somehow reimplementing the tests in here, no? Would be good to start by adding the exact case or test that your fix addresses, which is not immediately clear to me. Thanks!

@datapythonista It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: test_datetime_invalid_scalar, test_datetime_outofbounds_scalar, and test_datetime_invalid_index. While they do share a few similarities, there are some important key differences between these tests and my added tests.

To start with, test_datetime_invalid_scalar and test_datetime_outofbounds_scalar only deal in scalar values, unlike my tests, which deal with lists. The difference from test_datetime_invalid_index lies largely in the data being tested. The list used in test_datetime_invalid_index is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important because infer_datetime_format only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.

datapythonista · 2022-09-06T18:00:48Z

Thanks @srotondo that makes sense. Is there a reason why the tests you added are not next to the ones I mentioned? Seems like they should be next to each other. Or, not sure if it could make sense to add an extra assert in those lines to check the same cases when the invalid values are in a Series following a valid value.

And it's still not clear to me what's the exact case that you're fixing. Or are all the tests that you added failing without the change in the code function?

srotondo · 2022-09-06T20:52:59Z

@datapythonista The true failing case is that when 2 valid dates with different formats were passed in using the errors = "coerce" option and infer_datetime_format = True, to_datetime would return NaT for the second value, when it should correctly parse the second value. The second and third tests I wrote ensure that coerce still returns NaT when it should return NaT and that the cases where errors = "raise" still raises exceptions when it should, respectively.

The choice of file for this test is mostly arbitrary, and I can move it if you'd like. However, could you please tell me if there is anything else you would like me to change as well so that I can do all of it at the same time and we can get this bugfix merged.

MarcoGorelli

I think it would be accurate to add one more sentence at the end:

If subsequent datetime strings do not follow the inferred format, parsing will fall back to the slower method of determining the format for each string individually.

Hey - first, sorry to block this, and thanks for your work here

Just wanted to ask whether this is really what would be best - why not make infer_datetime_format strict:

if the format is inferred, and some row doesn't conform to it, it just errors or coerces, rather than falling back to the slower path

This would involve removing the special-casing here

pandas/pandas/core/tools/datetimes.py

Lines 488 to 499 in ac648ee

    
           if not infer_datetime_format: 
        
               if errors == "raise": 
        
                   raise 
        
               elif errors == "coerce": 
        
                   result = np.empty(arg.shape, dtype="M8[ns]") 
        
                   iresult = result.view("i8") 
        
                   iresult.fill(iNaT) 
        
               else: 
        
                   result = arg 
        
           else: 
        
               # Indicates to the caller to fallback to objects_to_datetime64ns 
        
               return None

and just have

        if errors == "raise":
            raise
        elif errors == "coerce":
            result = np.empty(arg.shape, dtype="M8[ns]")
            iresult = result.view("i8")
            iresult.fill(iNaT)
        else:
            result = arg

This would also solve the linked issue, and rather than adding a new if-then statement, it'd remove one

srotondo · 2022-09-16T20:09:38Z

Just wanted to ask whether this is really what would be best - why not make infer_datetime_format strict:

if the format is inferred, and some row doesn't conform to it, it just errors or coerces, rather than falling back to the slower path

@MarcoGorelli There exists large amounts of code with the express purpose of causing the operation to fall back if the formats aren't the same. The original documentation seems to strongly suggest that infer_datetime_format exists to cause parsing to speed up whenever possible, rather than causing otherwise valid dates to fail to parse. Also, given how many different ways a date string can be written, forcing all inputs to to_datetime to be the same format doesn't seem like the best course of action to me. I think that making this operation strict would go against the intended functionality.

MarcoGorelli · 2022-09-16T20:48:38Z

Currently, the docs don't specify what happens if the format can be inferred, but subsequent rows don't match that format

And users don't expect the format to change midway, especially when using this option, e.g. as in #46210

cc @mroeschke @jreback @jorisvandenbossche if you have comments

TL;DR: I'm suggesting that when using infer_datetime_format=True, the format detected from the first non-NaN value should be used to parse the rest of the Series, exactly as if the user had passed it to format=

Example
Concretely, I think the following two should behave the same:

pd.to_datetime(['01-31-2000', '20-01-2000'], infer_datetime_format=True, errors='raise')
pd.to_datetime(['01-31-2000', '20-01-2000'], format='%m-%d-%Y', errors='raise')

Currently, the latter raises (as expected) whereas the former swaps format midway

srotondo · 2022-09-16T21:45:06Z

@MarcoGorelli Even though it's not very well documented, looking at the code and following the pathway makes it very clear that the fallback is intentional. If you would like to change the behavior, I think that should be a separate PR. This code change deals with the fact that raise and coerce were behaving inconsistently with infer_datetime_format, which has been fixed. I think that removing the fallback behavior altogether is a much larger and separate issue.

MarcoGorelli

Hey - just had a closer look, and indeed the user guide does explicitly note:

pandas will fallback to the usual parsing if either the format cannot be guessed or the format that was guessed cannot properly parse the entire column of strings

That's fine then, I'll open a separate issue about changing that

MarcoGorelli

The fix itself looks good to me, thanks! I just have some comments about the tests:

is the location right? Wouldn't /pandas/tests/tools/test_to_datetime.py be better?
couldn't we just cover all these new tests by parametrising over errors in /pandas/tests/tools/test_to_datetime.py::TestToDatetimeInferFormat::test_to_datetime_infer_datetime_format_inconsistent_format ?

Apologies for how long reviews are taking, I'll do my best to help push this across the line now

pandas/core/tools/datetimes.py

MarcoGorelli · 2022-09-17T09:57:59Z

pandas/tests/arrays/test_datetimes.py

+        "error",
+        ["coerce", "raise"],
+    )
+    def test_fallback_different_formats(self, error):


Is this test duplicative of

pandas/pandas/tests/tools/test_to_datetime.py

Lines 2110 to 2125 in a712c50

@pytest.mark.parametrize(

"data",

[

["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"],

["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"],

],

)

def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data):

ser = Series(np.array(data))

# When the format is inconsistent, infer_datetime_format should just

# fallback to the default parsing

tm.assert_series_equal(

to_datetime(ser, infer_datetime_format=False, cache=cache),

to_datetime(ser, infer_datetime_format=True, cache=cache),

)

?

If so, why not just add

@pytest.mark.parametrize('errors', ['coerce', 'raise'])

to that test?

pandas/tests/arrays/test_datetimes.py

MarcoGorelli · 2022-09-17T10:16:34Z

pandas/tests/arrays/test_datetimes.py

+        "dateseries",
+        [
+            pd.Series(["1/1/2000", "7/12/1200"]),
+            pd.Series(["1/1/2000", "Invalid input"]),


isn't this duplicative of

pandas/pandas/tests/tools/test_to_datetime.py

Lines 989 to 1012 in a712c50

@pytest.mark.parametrize("values", [["a"], ["00:01:99"], ["a", "b", "99:00:00"]])

@pytest.mark.parametrize("infer", [True, False])

@pytest.mark.parametrize("format", [None, "H%:M%:S%"])

def test_datetime_invalid_index(self, values, format, infer):

# GH24763

res = to_datetime(

values, errors="ignore", format=format, infer_datetime_format=infer

)

tm.assert_index_equal(res, Index(values))

res = to_datetime(

values, errors="coerce", format=format, infer_datetime_format=infer

)

tm.assert_index_equal(res, DatetimeIndex([NaT] * len(values)))

msg = (

"is a bad directive in format|"

f"Given date string {values[0]} not likely a datetime|"

"second must be in 0..59"

)

with pytest.raises(ValueError, match=msg):

to_datetime(

values, errors="raise", format=format, infer_datetime_format=infer

)

?

Ok, that's a lot of comments there, so I'm going to try to answer all these questions here in this response here. Please tell me if it's not clear which comment I'm talking about at any given point here.

I was asked about these last 4 comments before, so I'll link my previous response here so you can see it in case you missed it.

It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: test_datetime_invalid_scalar, test_datetime_outofbounds_scalar, and test_datetime_invalid_index. While they do share a few similarities, there are some important key differences between these tests and my added tests.

To start with, test_datetime_invalid_scalar and test_datetime_outofbounds_scalar only deal in scalar values, unlike my tests, which deal with lists. The difference from test_datetime_invalid_index lies largely in the data being tested. The list used in test_datetime_invalid_index is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important because infer_datetime_format only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.

In terms of your comments about the code changes and the first test comment, I agree with those changes and I can do that for sure. Also, in terms of test location, I'm fully open to changing that, since the choice I made in location was largely arbitrary.

Thanks for explaining and for your attention to detail, that's very clear! Apologies for having missed that, I'd looked through in-line comments but had missed this one in the main thread

For test location - does pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat look good to you?

@srotondo have you had a look at #48621?

First, because you've had a close look at this part of the code so I'd really value your input

Second, because if we did go ahead with that, then it'd make the work in this PR unnecessary

For test location - does pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat look good to you?

To clarify, is this a new section in test_datetimes.py that you are suggesting, because as far as I can tell, this doesn't actually exist yet. Also, if we're going to keep the tests in this file, I guess it doesn't make much sense to change that test you mentioned at the very start since it's in a different file.

I meant to add new tests to the class which

pandas/pandas/tests/tools/test_to_datetime.py

Lines 2110 to 2125 in a712c50

@pytest.mark.parametrize(

"data",

[

["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"],

["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"],

],

)

def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data):

ser = Series(np.array(data))

# When the format is inconsistent, infer_datetime_format should just

# fallback to the default parsing

tm.assert_series_equal(

to_datetime(ser, infer_datetime_format=False, cache=cache),

to_datetime(ser, infer_datetime_format=True, cache=cache),

)

belongs to (TestToDatetimeInferFormat)

Co-authored-by: Marco Edward Gorelli <[email protected]>

MarcoGorelli · 2022-11-28T12:26:13Z

PDEP4 was accepted, and it looks like the implementation is almost in

Let's close this then

Thanks @srotondo for your PR - you've done great work, and I'm sorry to be closing it, I just think a different direction is a better direction for the future of pandas

KingOfTheShow69 added 2 commits July 15, 2022 11:16

BUG: Fixed behavior with fallback between raise and coerce pandas-dev…

0d320e9

…#46071

BUG: Fixed behavior with fallback between raise and coerce pandas-dev…

df08fd1

…#46071

datapythonista added the Bug label Jul 21, 2022

datapythonista added the Error Reporting Incorrect or improved errors from pandas label Jul 21, 2022

KingOfTheShow69 added 2 commits July 21, 2022 15:28

Merge remote-tracking branch 'upstream/main' into date-bug

7c69de8

BUG: Added test and release note and removed line shifts pandas-dev#4…

2ab558a

…6071

datapythonista requested changes Aug 8, 2022

View reviewed changes

pandas/tests/arrays/test_datetimes.py Show resolved Hide resolved

KingOfTheShow69 and others added 3 commits August 11, 2022 09:22

BUG: Edited documentation and improved tests pandas-dev#46071

78e6bc7

Merge branch 'main' into date-bug

3da1585

Merge branch 'main' into date-bug

b53d897

srotondo requested a review from datapythonista August 16, 2022 23:30

datapythonista requested changes Aug 17, 2022

View reviewed changes

KingOfTheShow69 added 2 commits August 17, 2022 14:42

BUG: Edited test structure pandas-dev#46071

ef1f736

Merge branch 'date-bug' of https://github.com/srotondo/pandas into da…

8867090

…te-bug

srotondo requested a review from datapythonista August 17, 2022 23:11

datapythonista requested changes Sep 1, 2022

View reviewed changes

BUG: Split up tests pandas-dev#46071

7347c78

srotondo requested a review from datapythonista September 2, 2022 20:44

MarcoGorelli suggested changes Sep 16, 2022

View reviewed changes

MarcoGorelli reviewed Sep 17, 2022

View reviewed changes

MarcoGorelli mentioned this pull request Sep 17, 2022

ENH: make infer_datetime_format strict #48596

Closed

MarcoGorelli suggested changes Sep 17, 2022

View reviewed changes

Update pandas/core/tools/datetimes.py

3b08c71

Co-authored-by: Marco Edward Gorelli <[email protected]>

MarcoGorelli mentioned this pull request Sep 19, 2022

PDEP-4: consistent parsing of datetimes #48621

Merged

MarcoGorelli closed this Nov 28, 2022

	if not infer_datetime_format:
	if errors == "raise":
	raise
	elif errors == "coerce":
	result = np.empty(arg.shape, dtype="M8[ns]")
	iresult = result.view("i8")
	iresult.fill(iNaT)
	else:
	result = arg
	else:
	# Indicates to the caller to fallback to objects_to_datetime64ns
	return None

	@pytest.mark.parametrize(
	"data",
	[
	["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"],
	["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"],
	],
	)
	def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data):
	ser = Series(np.array(data))

	# When the format is inconsistent, infer_datetime_format should just
	# fallback to the default parsing
	tm.assert_series_equal(
	to_datetime(ser, infer_datetime_format=False, cache=cache),
	to_datetime(ser, infer_datetime_format=True, cache=cache),
	)

	@pytest.mark.parametrize("values", [["a"], ["00:01:99"], ["a", "b", "99:00:00"]])
	@pytest.mark.parametrize("infer", [True, False])
	@pytest.mark.parametrize("format", [None, "H%:M%:S%"])
	def test_datetime_invalid_index(self, values, format, infer):
	# GH24763
	res = to_datetime(
	values, errors="ignore", format=format, infer_datetime_format=infer
	)
	tm.assert_index_equal(res, Index(values))

	res = to_datetime(
	values, errors="coerce", format=format, infer_datetime_format=infer
	)
	tm.assert_index_equal(res, DatetimeIndex([NaT] * len(values)))

	msg = (
	"is a bad directive in format\|"
	f"Given date string {values[0]} not likely a datetime\|"
	"second must be in 0..59"
	)
	with pytest.raises(ValueError, match=msg):
	to_datetime(
	values, errors="raise", format=format, infer_datetime_format=infer
	)

Uh oh!

BUG: Behavior with fallback between raise and coerce #46071 #47745

BUG: Behavior with fallback between raise and coerce #46071 #47745

Uh oh!

Conversation

srotondo commented Jul 15, 2022

Uh oh!

datapythonista commented Jul 21, 2022

Uh oh!

srotondo commented Aug 3, 2022

Uh oh!

datapythonista left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scott-r commented Aug 10, 2022

Uh oh!

srotondo commented Aug 16, 2022

Uh oh!

datapythonista left a comment

Choose a reason for hiding this comment

Uh oh!

datapythonista Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

srotondo Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

datapythonista Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

srotondo commented Aug 29, 2022

Uh oh!

datapythonista left a comment

Choose a reason for hiding this comment

Uh oh!

srotondo commented Sep 2, 2022

Uh oh!

datapythonista commented Sep 6, 2022

Uh oh!

srotondo commented Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datapythonista commented Sep 6, 2022

Uh oh!

srotondo commented Sep 6, 2022

Uh oh!

MarcoGorelli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srotondo commented Sep 16, 2022

Uh oh!

MarcoGorelli commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srotondo commented Sep 16, 2022

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MarcoGorelli Sep 17, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MarcoGorelli Sep 17, 2022

Choose a reason for hiding this comment

Uh oh!

srotondo Sep 17, 2022

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Sep 18, 2022

Choose a reason for hiding this comment

Uh oh!

srotondo commented Sep 6, 2022 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli commented Sep 16, 2022 •

edited

Loading

MarcoGorelli Sep 18, 2022 •

edited

Loading

srotondo Sep 18, 2022 •

edited

Loading

MarcoGorelli Sep 19, 2022 •

edited

Loading