-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Behavior with fallback between raise and coerce #46071 #47745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR @srotondo. Do you mind reverting the unrelated changes (removing or adding blank lines) please? Also, we need to add a test that captures the problem, fails with the original code, and passes after your changes. The issue number, better have it as a comment in the test, not in the code. Also, we need a release note with the change. You can have a look at any merged bug PRs as a reference. |
@datapythonista I've added the test and the release note to this code, is there anything else you would like me to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, and thanks for the changes @srotondo. Can you have a look at the comment please?
The fallback behavior is apparent in the code but not mentioned in the documentation for
I think it would be accurate to add one more sentence at the end:
That should be sufficient to remove the expectation that an error will be raised without committing to an exact description of the fallback parsing. |
@datapythonista I believe I have made the appropriate changes. Is there anything else? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @srotondo, this makes more sense. I suggest some changes to make code more readable, see what you can do, and I'll review again when it's easier to see every case being tested.
es1 = pd.Series(["1/1/2000", "7/12/1200"]) | ||
es2 = pd.Series(["1/1/2000", "Hello"]) | ||
|
||
if error == "coerce": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes @srotondo. This test is a bit complex and not very easy to read. I think we could make things more readable if we have if for example split this in different tests. Something like:
test_to_datetime_infer_datetime_format_with_different_formats
test_to_datetime_infer_datetime_format_with_nat
test_to_datetime_infer_datetime_format_with_out_of_bounds
test_to_datetime_infer_datetime_format_with_invalid_date
What do you think? Then each test would be quite easy to read. Or maybe by parametrizing the input and the two outputs (coerce and raise). This will avoid a bit of unrepeated code and be quite concise, but probably not so readable.
See what makes sense to you, but would be good to make things as simple as possible, as the pandas codebase is already huge and too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@datapythonista thank you for your feedback on the tests. Because the cases are all closely related, I think it will be easier to understand as one test, but I did simplify some of the test and added more comments to clarify parts of the test.
expected2 = pd.Series([pd.Timestamp("2000-01-01 00:00:00"), pd.NaT]) | ||
|
||
es1 = pd.Series(["1/1/2000", "7/12/1200"]) | ||
es2 = pd.Series(["1/1/2000", "Hello"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of Hello
use Invalid date
or something like that, so it's easier to understand what we're doing here.
@datapythonista I believe that I have made all the changes you have requested, including simplifying the tests. Is there anything else you would like me to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I continue to find this test too hard to read. And very hard to understand what's wrong when it fails. I need to insist on using good testing practices, and restructure in a way that make tests atomic and simple. Like not having if
statements, not having more than one expected value (unless using parametrization)...
If you don't find a better, way, using the previously suggested tests is a good start.
@datapythonista I've split up the test into multiple pieces like you asked. |
@datapythonista It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file: To start with, |
Thanks @srotondo that makes sense. Is there a reason why the tests you added are not next to the ones I mentioned? Seems like they should be next to each other. Or, not sure if it could make sense to add an extra assert in those lines to check the same cases when the invalid values are in a Series following a valid value. And it's still not clear to me what's the exact case that you're fixing. Or are all the tests that you added failing without the change in the code function? |
@datapythonista The true failing case is that when 2 valid dates with different formats were passed in using the The choice of file for this test is mostly arbitrary, and I can move it if you'd like. However, could you please tell me if there is anything else you would like me to change as well so that I can do all of it at the same time and we can get this bugfix merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be accurate to add one more sentence at the end:
If subsequent datetime strings do not follow the inferred format, parsing will fall back to the slower method of determining the format for each string individually.
Hey - first, sorry to block this, and thanks for your work here
Just wanted to ask whether this is really what would be best - why not make infer_datetime_format
strict:
- if the format is inferred, and some row doesn't conform to it, it just errors or coerces, rather than falling back to the slower path
This would involve removing the special-casing here
pandas/pandas/core/tools/datetimes.py
Lines 488 to 499 in ac648ee
if not infer_datetime_format: | |
if errors == "raise": | |
raise | |
elif errors == "coerce": | |
result = np.empty(arg.shape, dtype="M8[ns]") | |
iresult = result.view("i8") | |
iresult.fill(iNaT) | |
else: | |
result = arg | |
else: | |
# Indicates to the caller to fallback to objects_to_datetime64ns | |
return None |
and just have
if errors == "raise":
raise
elif errors == "coerce":
result = np.empty(arg.shape, dtype="M8[ns]")
iresult = result.view("i8")
iresult.fill(iNaT)
else:
result = arg
This would also solve the linked issue, and rather than adding a new if-then statement, it'd remove one
@MarcoGorelli There exists large amounts of code with the express purpose of causing the operation to fall back if the formats aren't the same. The original documentation seems to strongly suggest that |
Currently, the docs don't specify what happens if the format can be inferred, but subsequent rows don't match that format And users don't expect the format to change midway, especially when using this option, e.g. as in #46210 cc @mroeschke @jreback @jorisvandenbossche if you have comments TL;DR: I'm suggesting that when using Example
Currently, the latter raises (as expected) whereas the former swaps format midway |
@MarcoGorelli Even though it's not very well documented, looking at the code and following the pathway makes it very clear that the fallback is intentional. If you would like to change the behavior, I think that should be a separate PR. This code change deals with the fact that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - just had a closer look, and indeed the user guide does explicitly note:
pandas will fallback to the usual parsing if either the format cannot be guessed or the format that was guessed cannot properly parse the entire column of strings
That's fine then, I'll open a separate issue about changing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix itself looks good to me, thanks! I just have some comments about the tests:
- is the location right? Wouldn't
/pandas/tests/tools/test_to_datetime.py
be better? - couldn't we just cover all these new tests by parametrising over
errors
in/pandas/tests/tools/test_to_datetime.py::TestToDatetimeInferFormat::test_to_datetime_infer_datetime_format_inconsistent_format
?
Apologies for how long reviews are taking, I'll do my best to help push this across the line now
"error", | ||
["coerce", "raise"], | ||
) | ||
def test_fallback_different_formats(self, error): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test duplicative of
pandas/pandas/tests/tools/test_to_datetime.py
Lines 2110 to 2125 in a712c50
@pytest.mark.parametrize( | |
"data", | |
[ | |
["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"], | |
["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"], | |
], | |
) | |
def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data): | |
ser = Series(np.array(data)) | |
# When the format is inconsistent, infer_datetime_format should just | |
# fallback to the default parsing | |
tm.assert_series_equal( | |
to_datetime(ser, infer_datetime_format=False, cache=cache), | |
to_datetime(ser, infer_datetime_format=True, cache=cache), | |
) |
?
If so, why not just add
@pytest.mark.parametrize('errors', ['coerce', 'raise'])
to that test?
"dateseries", | ||
[ | ||
pd.Series(["1/1/2000", "7/12/1200"]), | ||
pd.Series(["1/1/2000", "Invalid input"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this duplicative of
pandas/pandas/tests/tools/test_to_datetime.py
Lines 989 to 1012 in a712c50
@pytest.mark.parametrize("values", [["a"], ["00:01:99"], ["a", "b", "99:00:00"]]) | |
@pytest.mark.parametrize("infer", [True, False]) | |
@pytest.mark.parametrize("format", [None, "H%:M%:S%"]) | |
def test_datetime_invalid_index(self, values, format, infer): | |
# GH24763 | |
res = to_datetime( | |
values, errors="ignore", format=format, infer_datetime_format=infer | |
) | |
tm.assert_index_equal(res, Index(values)) | |
res = to_datetime( | |
values, errors="coerce", format=format, infer_datetime_format=infer | |
) | |
tm.assert_index_equal(res, DatetimeIndex([NaT] * len(values))) | |
msg = ( | |
"is a bad directive in format|" | |
f"Given date string {values[0]} not likely a datetime|" | |
"second must be in 0..59" | |
) | |
with pytest.raises(ValueError, match=msg): | |
to_datetime( | |
values, errors="raise", format=format, infer_datetime_format=infer | |
) |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's a lot of comments there, so I'm going to try to answer all these questions here in this response here. Please tell me if it's not clear which comment I'm talking about at any given point here.
I was asked about these last 4 comments before, so I'll link my previous response here so you can see it in case you missed it.
It might look like some of these tests are reimplementations, but there are some key differences in the cases that are being tested. Given the line number you linked in that test, I assume you are mainly referring to these 3 tests in that file:
test_datetime_invalid_scalar
,test_datetime_outofbounds_scalar
, andtest_datetime_invalid_index
. While they do share a few similarities, there are some important key differences between these tests and my added tests.To start with,
test_datetime_invalid_scalar
andtest_datetime_outofbounds_scalar
only deal in scalar values, unlike my tests, which deal with lists. The difference fromtest_datetime_invalid_index
lies largely in the data being tested. The list used intest_datetime_invalid_index
is entirely made up of values that will not parse, while the lists in my tests always start with a valid date. This is important becauseinfer_datetime_format
only becomes relevant on the value after the first valid date, since it would need a format already to infer from. Therefore this test doesn't really check the fallback code as my tests are meant to.
In terms of your comments about the code changes and the first test comment, I agree with those changes and I can do that for sure. Also, in terms of test location, I'm fully open to changing that, since the choice I made in location was largely arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining and for your attention to detail, that's very clear! Apologies for having missed that, I'd looked through in-line comments but had missed this one in the main thread
For test location - does pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat
look good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test location - does
pandas/tests/arrays/test_datetimes.py::TestToDatetimeInferFormat
look good to you?
To clarify, is this a new section in test_datetimes.py
that you are suggesting, because as far as I can tell, this doesn't actually exist yet. Also, if we're going to keep the tests in this file, I guess it doesn't make much sense to change that test you mentioned at the very start since it's in a different file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to add new tests to the class which
pandas/pandas/tests/tools/test_to_datetime.py
Lines 2110 to 2125 in a712c50
@pytest.mark.parametrize( | |
"data", | |
[ | |
["01/01/2011 00:00:00", "01-02-2011 00:00:00", "2011-01-03T00:00:00"], | |
["Jan/01/2011", "Feb/01/2011", "Mar/01/2011"], | |
], | |
) | |
def test_to_datetime_infer_datetime_format_inconsistent_format(self, cache, data): | |
ser = Series(np.array(data)) | |
# When the format is inconsistent, infer_datetime_format should just | |
# fallback to the default parsing | |
tm.assert_series_equal( | |
to_datetime(ser, infer_datetime_format=False, cache=cache), | |
to_datetime(ser, infer_datetime_format=True, cache=cache), | |
) |
belongs to (TestToDatetimeInferFormat)
Co-authored-by: Marco Edward Gorelli <[email protected]>
PDEP4 was accepted, and it looks like the implementation is almost in Let's close this then Thanks @srotondo for your PR - you've done great work, and I'm sorry to be closing it, I just think a different direction is a better direction for the future of pandas |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.The problem reported by this issue is that success in parsing a date string depends on how errors are reported (coerce vs raise). How the error is reported should not affect whether or not a parsing error occurs. Initially it appears that experiment 1 should return an error (as experiment 3 does) because the second date string does not match the format established by the first string. However, there is a fallback mechanism that is supposed to try parsing again without the effect of the
infer_datetime_format
flag. So in fact, experiment 3 should succeed in parsing the date as experiment 1 does.The problem is that the fallback mechanism only works when
errors=raise
. This bug fix applies the same logic whenerrors=coerce
so that the fallback parsing will take place regardless of how errors are reported.