Skip to content

BUG: out-of-bounds pydatetime with errors='coerce' and non-ISO format coerces all elements to NaT #50265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 20, 2022

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Dec 15, 2022

I haven't added a whatsnew note for #50255 , as this feature (parsing pydatetime / datetime / np.datetime64) objects with non-ISO formats wasn't available to users in the previous release anyway


This is much easier to review with "hide whitespace", as many of the changes are just changing a level of indentation

image

@MarcoGorelli MarcoGorelli added Datetime Datetime data dtype Bug labels Dec 15, 2022
@MarcoGorelli MarcoGorelli force-pushed the coercion branch 4 times, most recently from 2ac0725 to a8eff6c Compare December 15, 2022 10:22
@MarcoGorelli MarcoGorelli force-pushed the coercion branch 2 times, most recently from 4a32933 to d60b631 Compare December 15, 2022 10:39
Comment on lines -1078 to 1130
"format,warning", [(None, UserWarning), ("H%:M%:S%", None)]
"format,warning", [(None, UserWarning), ("%H:%M:%S", None)]
)
def test_datetime_invalid_scalar(self, value, format, warning):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test claims to be testing invalid scalars, but in reality it was erroring before the parsing even started because "H%:M%:S%" is an invalid format

So, I've changed the format to reflect what it looks like this test was meant to be testing, and have made a separate test above to check that invalid formats raise

@@ -1098,7 +1107,7 @@ def test_datetime_invalid_scalar(self, value, format, warning):

@pytest.mark.parametrize("value", ["3000/12/11 00:00:00"])
@pytest.mark.parametrize(
"format,warning", [(None, UserWarning), ("H%:M%:S%", None)]
"format,warning", [(None, UserWarning), ("%H:%M:%S", None)]
)
def test_datetime_outofbounds_scalar(self, value, format, warning):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above (this errors before even reaching the out-of-bounds scalar...)

found = format_regex.search(val)
if not found:
if is_coerce:
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the code inside this try/except just indenting whats currently there or are there other changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's basically just that, but I also removed some

                if is_coerce:
                    iresult[i] = NPY_NAT
                    continue

as that already happens in the outer try-except

If you hide whitespace it'll be easier to see - you can also go to https://github.com/pandas-dev/pandas/pull/50265/files?diff=unified&w=1 (notice the ?diff=unified&w=1 at the end) to see it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I'm not a big fan of try/excepts with a huge block of code inside, but I guess since the failures modes here should be made consistent with the broader to_datetime(errors=) keyword I suppose this is fine.

Question, sometime we are issues some ValueErrors internally in this function that will be caught by this try except e.g.

            if iso_year != -1:
                if iso_week == -1 or weekday == -1:
                    raise ValueError("ISO year directive '%G' must be used with "
                                     "the ISO week directive '%V' and a weekday "
                                     "directive '%A', '%a', '%w', or '%u'.")

Is this necessarily a "parsing" error that should be influenced by the errors keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rewrite it such that each check_dts_bounds is within its own try-except - maybe that's better actually, thanks

I'll get back to you on that ValueError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, refactored

That's a really good point about the "ISO year directive" error, thanks, good catch! So, I think that one should always raise. At least, that seems to be the current intention, there's no if is_coerce check there like there is for some other errors, such as

except ValueError:
if is_coerce:
iresult[i] = NPY_NAT
continue
raise

So, I've parametrised test_error_iso_week_year too over errors to check that this one actually raises

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not actually opposed to the large try/except block if we expect ValueError and OutofBounds Errors to be affected by errors. If other excepts in this functions can be factored out (like the iso year directive one) and others then it would help ensure errors is consistently handled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks - have factored out the iso year directive check in #50309, so I've gone back to having an outer try-except block (so again, "hide-whitespace" will come in handy when reviewing)

@MarcoGorelli MarcoGorelli marked this pull request as draft December 16, 2022 18:27
@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 16, 2022 19:46
@MarcoGorelli MarcoGorelli marked this pull request as draft December 17, 2022 00:55
@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 19, 2022 21:15
@MarcoGorelli
Copy link
Member Author

@jbrockmendel @mroeschke sorry for the ping, just wanted to ask if we could get this one in, as:

That would conclude the major to_datetime block of work I was planning to do - anything else would be relatively minor / fixing anything that comes up

Thanks 🙌

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MarcoGorelli MarcoGorelli merged commit ca04349 into pandas-dev:main Dec 20, 2022
@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
3 participants