Skip to content

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Dec 31, 2022
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ repos:
'--extensions=c,h',
'--headers=h',
--recursive,
'--filter=-readability/casting,-runtime/int,-build/include_subdir'
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good check to keep in place - otherwise these functions get unwieldy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the function is now 522 lines long, whereas the limit for this check is 500

Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.

Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I'll see if I can upstream something, thanks!

]
- repo: https://github.com/PyCQA/flake8
rev: 6.0.0
Expand Down
3 changes: 2 additions & 1 deletion pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ cnp.import_array()

from pandas._libs.tslibs.np_datetime cimport (
NPY_DATETIMEUNIT,
Exact,
NPY_FR_ns,
check_dts_bounds,
get_datetime64_value,
Expand Down Expand Up @@ -403,7 +404,7 @@ cpdef array_to_datetime(
bint utc=False,
bint require_iso8601=False,
format: str | None=None,
bint exact=True,
Exact exact=Exact.EXACT_MATCH,
):
"""
Converts a 1D array of date-like values to a numpy array of either:
Expand Down
8 changes: 7 additions & 1 deletion pandas/_libs/tslibs/np_datetime.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ cdef int string_to_dts(
int* out_tzoffset,
bint want_exc,
format: str | None = *,
bint exact = *
Exact exact = *
) except? -1

cdef NPY_DATETIMEUNIT get_unit_from_dtype(cnp.dtype dtype)
Expand All @@ -120,3 +120,9 @@ cdef int64_t convert_reso(
NPY_DATETIMEUNIT to_reso,
bint round_ok,
) except? -1

cdef extern from "src/datetime/np_datetime_strings.h":
cdef enum Exact:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name Exact is a little too vague - maybe better as DatetimeFormatRequirement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good call, I've gone with FormatRequirement to keep lines not-too-long

PARTIAL_MATCH
EXACT_MATCH
NO_MATCH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does NoMatch really mean that the format is inferred?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I've renamed to INFER_FORMAT, thanks!

6 changes: 3 additions & 3 deletions pandas/_libs/tslibs/np_datetime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ cdef extern from "src/datetime/np_datetime_strings.h":
npy_datetimestruct *out,
NPY_DATETIMEUNIT *out_bestunit,
int *out_local, int *out_tzoffset,
const char *format, int format_len, int exact)
const char *format, int format_len, Exact exact)


# ----------------------------------------------------------------------
Expand Down Expand Up @@ -279,7 +279,7 @@ cdef int string_to_dts(
int* out_tzoffset,
bint want_exc,
format: str | None=None,
bint exact=True,
Exact exact=EXACT_MATCH,
) except? -1:
cdef:
Py_ssize_t length
Expand All @@ -291,7 +291,7 @@ cdef int string_to_dts(
if format is None:
format_buf = b""
format_length = 0
exact = False
exact = NO_MATCH
else:
format_buf = get_c_string_buf_and_size(format, &format_length)
return parse_iso_8601_datetime(buf, length, want_exc,
Expand Down
143 changes: 105 additions & 38 deletions pandas/_libs/tslibs/src/datetime/np_datetime_strings.c
Original file line number Diff line number Diff line change
Expand Up @@ -67,49 +67,59 @@ This file implements string parsing and creation for NumPy datetime.
* Returns 0 on success, -1 on failure.
*/

enum Comparison {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design wise this assumes that the callee knows what the caller is doing and can instruct it on actions to take. I think it would be better to separate those entities and just have the callee report back what it knows.

With that in mind, maybe call rename this to DatetimePartParseResult and maybe have values of PARTIAL_MATCH, EXACT_MATCH, NO_MATCH. The caller can then choose to take action independent of this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, to name the values the same way as those from FormatRequirement?

The issue is that different format requirements can result in the same result from this function - for example, both EXACT_MATCH where the format matches and INFER_FORMAT can return 0

I've renamed the values to

    COMPARISON_SUCCESS,
    COMPLETED_PARTIAL_MATCH,
    COMPARISON_ERROR

, is that clearer?

ADVANCE,
RETURN,
ERROR
};
// This function will advance the pointer on format
// and decrement characters_remaining by n on success
// On failure will return -1 without incrementing
static int compare_format(const char **format, int *characters_remaining,
const char *compare_to, int n, const int exact) {
// On failure will return ERROR without incrementing
// If `exact` is PARTIAL_MATCH, and the `format` string has
// been exhausted, then signal to the caller to finish parsing.
static enum Comparison compare_format(
const char **format,
int *characters_remaining,
const char *compare_to,
int n,
const enum Exact exact
) {
if (exact == PARTIAL_MATCH && !*characters_remaining) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (exact == PARTIAL_MATCH && !*characters_remaining) {
if (exact == PARTIAL_MATCH && *characters_remaining == 0) {

Nit but would be good to explicitly compare to 0. Depending on code structure we may also want to be careful what happens if characters_remaining somehow ends up as negative

return RETURN;
}
if (exact == NO_MATCH) {
return ADVANCE;
}
if (*characters_remaining < n) {
if (exact) {
// TODO(pandas-dev): in the future we should set a PyErr here
// to be very clear about what went wrong
return -1;
} else if (*characters_remaining) {
// TODO(pandas-dev): same return value in this function as
// above branch, but stub out a future where
// we have a better error message
return -1;
} else {
return 0;
}
// TODO(pandas-dev): PyErr to differentiate what went wrong
return ERROR;
} else {
if (strncmp(*format, compare_to, n)) {
// TODO(pandas-dev): PyErr to differentiate what went wrong
return -1;
return ERROR;
} else {
*format += n;
*characters_remaining -= n;
return 0;
return ADVANCE;
}
}
return 0;
return ADVANCE;
}

int parse_iso_8601_datetime(const char *str, int len, int want_exc,
npy_datetimestruct *out,
NPY_DATETIMEUNIT *out_bestunit,
int *out_local, int *out_tzoffset,
const char* format, int format_len, int exact) {
const char* format, int format_len,
enum Exact exact) {
if (len < 0 || format_len < 0)
goto parse_error;
int year_leap = 0;
int i, numdigits;
const char *substr;
int sublen;
NPY_DATETIMEUNIT bestunit = NPY_FR_GENERIC;
enum Comparison comparison;

/* If year-month-day are separated by a valid separator,
* months/days without leading zeroes will be parsed
Expand Down Expand Up @@ -139,8 +149,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (compare_format(&format, &format_len, " ", 1, exact)) {
comparison = compare_format(&format, &format_len, " ", 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
}

Expand All @@ -155,8 +168,14 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
}

/* PARSE THE YEAR (4 digits) */
if (compare_format(&format, &format_len, "%Y", 2, exact)) {
if (exact == PARTIAL_MATCH && !format_len) {
goto finish;
}
comparison = compare_format(&format, &format_len, "%Y", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}

out->year = 0;
Expand Down Expand Up @@ -202,8 +221,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
++substr;
--sublen;

if (compare_format(&format, &format_len, &ymd_sep, 1, exact)) {
comparison = compare_format(&format, &format_len, &ymd_sep, 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* Cannot have trailing separator */
if (sublen == 0 || !isdigit(*substr)) {
Expand All @@ -212,8 +234,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
}

/* PARSE THE MONTH */
if (compare_format(&format, &format_len, "%m", 2, exact)) {
comparison = compare_format(&format, &format_len, "%m", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* First digit required */
out->month = (*substr - '0');
Expand Down Expand Up @@ -258,14 +283,20 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
}
++substr;
--sublen;
if (compare_format(&format, &format_len, &ymd_sep, 1, exact)) {
comparison = compare_format(&format, &format_len, &ymd_sep, 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
}

/* PARSE THE DAY */
if (compare_format(&format, &format_len, "%d", 2, exact)) {
comparison = compare_format(&format, &format_len, "%d", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* First digit required */
if (!isdigit(*substr)) {
Expand Down Expand Up @@ -306,15 +337,21 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
if ((*substr != 'T' && *substr != ' ') || sublen == 1) {
goto parse_error;
}
if (compare_format(&format, &format_len, substr, 1, exact)) {
goto parse_error;
}
comparison = compare_format(&format, &format_len, substr, 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
++substr;
--sublen;

/* PARSE THE HOURS */
if (compare_format(&format, &format_len, "%H", 2, exact)) {
comparison = compare_format(&format, &format_len, "%H", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* First digit required */
if (!isdigit(*substr)) {
Expand Down Expand Up @@ -359,8 +396,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
if (sublen == 0 || !isdigit(*substr)) {
goto parse_error;
}
if (compare_format(&format, &format_len, ":", 1, exact)) {
comparison = compare_format(&format, &format_len, ":", 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
} else if (!isdigit(*substr)) {
if (!hour_was_2_digits) {
Expand All @@ -370,8 +410,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
}

/* PARSE THE MINUTES */
if (compare_format(&format, &format_len, "%M", 2, exact)) {
comparison = compare_format(&format, &format_len, "%M", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* First digit required */
out->min = (*substr - '0');
Expand Down Expand Up @@ -405,8 +448,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
/* If we make it through this condition block, then the next
* character is a digit. */
if (has_hms_sep && *substr == ':') {
if (compare_format(&format, &format_len, ":", 1, exact)) {
comparison = compare_format(&format, &format_len, ":", 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
++substr;
--sublen;
Expand All @@ -420,8 +466,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
}

/* PARSE THE SECONDS */
if (compare_format(&format, &format_len, "%S", 2, exact)) {
comparison = compare_format(&format, &format_len, "%S", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* First digit required */
out->sec = (*substr - '0');
Expand All @@ -448,17 +497,23 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
if (sublen > 0 && *substr == '.') {
++substr;
--sublen;
if (compare_format(&format, &format_len, ".", 1, exact)) {
comparison = compare_format(&format, &format_len, ".", 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
} else {
bestunit = NPY_FR_s;
goto parse_timezone;
}

/* PARSE THE MICROSECONDS (0 to 6 digits) */
if (compare_format(&format, &format_len, "%f", 2, exact)) {
comparison = compare_format(&format, &format_len, "%f", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
numdigits = 0;
for (i = 0; i < 6; ++i) {
Expand Down Expand Up @@ -524,8 +579,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (compare_format(&format, &format_len, " ", 1, exact)) {
comparison = compare_format(&format, &format_len, " ", 1, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
}

Expand All @@ -539,8 +597,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,

/* UTC specifier */
if (*substr == 'Z') {
if (compare_format(&format, &format_len, "%z", 2, exact)) {
comparison = compare_format(&format, &format_len, "%z", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* "Z" should be equivalent to tz offset "+00:00" */
if (out_local != NULL) {
Expand All @@ -561,8 +622,11 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
--sublen;
}
} else if (*substr == '-' || *substr == '+') {
if (compare_format(&format, &format_len, "%z", 2, exact)) {
comparison = compare_format(&format, &format_len, "%z", 2, exact);
if (comparison == ERROR) {
goto parse_error;
} else if (comparison == RETURN) {
goto finish;
}
/* Time zone offset */
int offset_neg = 0, offset_hour = 0, offset_minute = 0;
Expand Down Expand Up @@ -647,6 +711,9 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
while (sublen > 0 && isspace(*substr)) {
++substr;
--sublen;
if (exact == PARTIAL_MATCH && !format_len) {
goto finish;
}
if (compare_format(&format, &format_len, " ", 1, exact)) {
goto parse_error;
}
Expand Down
Loading