-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: inconsistent handling of exact=False case in to_datetime parsing #50435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistent handling of exact=False case in to_datetime parsing #50435
Conversation
b366e31
to
ee7f95e
Compare
b0cd67c
to
3de2331
Compare
@@ -63,7 +63,7 @@ repos: | |||
'--extensions=c,h', | |||
'--headers=h', | |||
--recursive, | |||
'--filter=-readability/casting,-runtime/int,-build/include_subdir' | |||
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good check to keep in place - otherwise these functions get unwieldy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the function is now 522 lines long, whereas the limit for this check is 500
Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.
Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll see if I can upstream something, thanks!
@@ -139,6 +133,9 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |||
while (sublen > 0 && isspace(*substr)) { | |||
++substr; | |||
--sublen; | |||
if (exact == PARTIAL_MATCH && !format_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not just make compare_format
return a set of Enum depending on what is left in the string to consume and what the matching semantics are? Seems like it would naturally fit there rather than a separate branch every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I think you can return an enum from check_format
of values like:
OK_EXACT
OK_PARTIAL
etc... describing the different states, then branch in the caller appropriately
* * NO_MATCH: don't require any match - parse without comparing | ||
* with 'format'. | ||
*/ | ||
enum Exact { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this file is vendored from numpy. The ship has sailed a bit in terms of editing directly, but when we move to Meson and abandon setuptools its worth considering a split to put all of our custom logic into a separate library and leaving the vendored code in place (or upstreaming changes if they make sense for numpy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your review!
@@ -63,7 +63,7 @@ repos: | |||
'--extensions=c,h', | |||
'--headers=h', | |||
--recursive, | |||
'--filter=-readability/casting,-runtime/int,-build/include_subdir' | |||
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the function is now 522 lines long, whereas the limit for this check is 500
Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?
@@ -63,7 +63,7 @@ repos: | |||
'--extensions=c,h', | |||
'--headers=h', | |||
--recursive, | |||
'--filter=-readability/casting,-runtime/int,-build/include_subdir' | |||
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.
Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream
pandas/_libs/tslibs/np_datetime.pxd
Outdated
@@ -120,3 +120,9 @@ cdef int64_t convert_reso( | |||
NPY_DATETIMEUNIT to_reso, | |||
bint round_ok, | |||
) except? -1 | |||
|
|||
cdef extern from "src/datetime/np_datetime_strings.h": | |||
cdef enum Exact: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name Exact
is a little too vague - maybe better as DatetimeFormatRequirement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good call, I've gone with FormatRequirement
to keep lines not-too-long
pandas/_libs/tslibs/np_datetime.pxd
Outdated
cdef enum Exact: | ||
PARTIAL_MATCH | ||
EXACT_MATCH | ||
NO_MATCH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NoMatch
really mean that the format is inferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I've renamed to INFER_FORMAT
, thanks!
@@ -67,49 +67,59 @@ This file implements string parsing and creation for NumPy datetime. | |||
* Returns 0 on success, -1 on failure. | |||
*/ | |||
|
|||
enum Comparison { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design wise this assumes that the callee knows what the caller is doing and can instruct it on actions to take. I think it would be better to separate those entities and just have the callee report back what it knows.
With that in mind, maybe call rename this to DatetimePartParseResult
and maybe have values of PARTIAL_MATCH
, EXACT_MATCH
, NO_MATCH
. The caller can then choose to take action independent of this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, to name the values the same way as those from FormatRequirement
?
The issue is that different format requirements can result in the same result from this function - for example, both EXACT_MATCH
where the format matches and INFER_FORMAT
can return 0
I've renamed the values to
COMPARISON_SUCCESS,
COMPLETED_PARTIAL_MATCH,
COMPARISON_ERROR
, is that clearer?
int n, | ||
const enum Exact exact | ||
) { | ||
if (exact == PARTIAL_MATCH && !*characters_remaining) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (exact == PARTIAL_MATCH && !*characters_remaining) { | |
if (exact == PARTIAL_MATCH && *characters_remaining == 0) { |
Nit but would be good to explicitly compare to 0. Depending on code structure we may also want to be careful what happens if characters_remaining
somehow ends up as negative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. minor nits on typedefs otherwise lgtm
@@ -67,49 +67,59 @@ This file implements string parsing and creation for NumPy datetime. | |||
* Returns 0 on success, -1 on failure. | |||
*/ | |||
|
|||
enum DatetimePartParseResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use typedef
here you don't need to repeat enum
every time you refer to this type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thanks!
* be able to parse it without error is '%Y-%m-%d'; | ||
* * INFER_FORMAT: parse without comparing 'format' (i.e. infer it). | ||
*/ | ||
enum FormatRequirement { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should typedef here as well
int n, | ||
const FormatRequirement format_requirement | ||
) { | ||
if (format_requirement == PARTIAL_MATCH && !*characters_remaining) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also a nit but I think we need to handle characters_remaining being negative. It could just simply return a COMPARISON_ERROR
right?
Understood it is impossible in the current state of things. However, if this gets refactored in the future and a negative number makes its way in here uncaught I think it would return a COMPARISON_SUCCCESS
and be very difficult to troubleshoot without intimate knowledge of this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, thanks for this (and other) thoughtful comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also a nit but I think we need to handle characters_remaining being 0
I presume you meant "less than 0" - that's what I've gone for anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm on green
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Haven't added a whatsnew note, as
exact
never worked to begin with for ISO8601 formats, and this just corrects #49333