Skip to content

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Dec 31, 2022

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Dec 25, 2022

Haven't added a whatsnew note, as exact never worked to begin with for ISO8601 formats, and this just corrects #49333

@MarcoGorelli MarcoGorelli force-pushed the exact-inconsistencies branch 2 times, most recently from b366e31 to ee7f95e Compare December 25, 2022 11:09
@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 25, 2022 12:15
@MarcoGorelli MarcoGorelli added Bug Datetime Datetime data dtype labels Dec 25, 2022
@mroeschke mroeschke added this to the 2.0 milestone Dec 27, 2022
@@ -63,7 +63,7 @@ repos:
'--extensions=c,h',
'--headers=h',
--recursive,
'--filter=-readability/casting,-runtime/int,-build/include_subdir'
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good check to keep in place - otherwise these functions get unwieldy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the function is now 522 lines long, whereas the limit for this check is 500

Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.

Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll see if I can upstream something, thanks!

@@ -139,6 +133,9 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (exact == PARTIAL_MATCH && !format_len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not just make compare_format return a set of Enum depending on what is left in the string to consume and what the matching semantics are? Seems like it would naturally fit there rather than a separate branch every time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I think you can return an enum from check_format of values like:

OK_EXACT
OK_PARTIAL

etc... describing the different states, then branch in the caller appropriately

* * NO_MATCH: don't require any match - parse without comparing
* with 'format'.
*/
enum Exact {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this file is vendored from numpy. The ship has sailed a bit in terms of editing directly, but when we move to Meson and abandon setuptools its worth considering a split to put all of our custom logic into a separate library and leaving the vendored code in place (or upstreaming changes if they make sense for numpy)

Copy link
Member Author

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your review!

@@ -63,7 +63,7 @@ repos:
'--extensions=c,h',
'--headers=h',
--recursive,
'--filter=-readability/casting,-runtime/int,-build/include_subdir'
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the function is now 522 lines long, whereas the limit for this check is 500

Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?

@@ -63,7 +63,7 @@ repos:
'--extensions=c,h',
'--headers=h',
--recursive,
'--filter=-readability/casting,-runtime/int,-build/include_subdir'
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.

Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream

@@ -120,3 +120,9 @@ cdef int64_t convert_reso(
NPY_DATETIMEUNIT to_reso,
bint round_ok,
) except? -1

cdef extern from "src/datetime/np_datetime_strings.h":
cdef enum Exact:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name Exact is a little too vague - maybe better as DatetimeFormatRequirement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good call, I've gone with FormatRequirement to keep lines not-too-long

cdef enum Exact:
PARTIAL_MATCH
EXACT_MATCH
NO_MATCH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NoMatch really mean that the format is inferred?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I've renamed to INFER_FORMAT, thanks!

@@ -67,49 +67,59 @@ This file implements string parsing and creation for NumPy datetime.
* Returns 0 on success, -1 on failure.
*/

enum Comparison {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design wise this assumes that the callee knows what the caller is doing and can instruct it on actions to take. I think it would be better to separate those entities and just have the callee report back what it knows.

With that in mind, maybe call rename this to DatetimePartParseResult and maybe have values of PARTIAL_MATCH, EXACT_MATCH, NO_MATCH. The caller can then choose to take action independent of this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, to name the values the same way as those from FormatRequirement?

The issue is that different format requirements can result in the same result from this function - for example, both EXACT_MATCH where the format matches and INFER_FORMAT can return 0

I've renamed the values to

    COMPARISON_SUCCESS,
    COMPLETED_PARTIAL_MATCH,
    COMPARISON_ERROR

, is that clearer?

int n,
const enum Exact exact
) {
if (exact == PARTIAL_MATCH && !*characters_remaining) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (exact == PARTIAL_MATCH && !*characters_remaining) {
if (exact == PARTIAL_MATCH && *characters_remaining == 0) {

Nit but would be good to explicitly compare to 0. Depending on code structure we may also want to be careful what happens if characters_remaining somehow ends up as negative

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good. minor nits on typedefs otherwise lgtm

@@ -67,49 +67,59 @@ This file implements string parsing and creation for NumPy datetime.
* Returns 0 on success, -1 on failure.
*/

enum DatetimePartParseResult {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use typedef here you don't need to repeat enum every time you refer to this type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks!

* be able to parse it without error is '%Y-%m-%d';
* * INFER_FORMAT: parse without comparing 'format' (i.e. infer it).
*/
enum FormatRequirement {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should typedef here as well

int n,
const FormatRequirement format_requirement
) {
if (format_requirement == PARTIAL_MATCH && !*characters_remaining) {
Copy link
Member

@WillAyd WillAyd Dec 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a nit but I think we need to handle characters_remaining being negative. It could just simply return a COMPARISON_ERROR right?

Understood it is impossible in the current state of things. However, if this gets refactored in the future and a negative number makes its way in here uncaught I think it would return a COMPARISON_SUCCCESS and be very difficult to troubleshoot without intimate knowledge of this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, thanks for this (and other) thoughtful comments!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a nit but I think we need to handle characters_remaining being 0

I presume you meant "less than 0" - that's what I've gone for anyway

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm on green

@MarcoGorelli MarcoGorelli merged commit a28cadb into pandas-dev:main Dec 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsistent handling of exact=False case in to_datetime parsing
3 participants