BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

MarcoGorelli · 2022-12-25T11:06:42Z

closes BUG: inconsistent handling of exact=False case in to_datetime parsing #50412 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Haven't added a whatsnew note, as exact never worked to begin with for ISO8601 formats, and this just corrects #49333

pandas/_libs/tslibs/np_datetime.pxd

WillAyd · 2022-12-27T21:37:14Z

.pre-commit-config.yaml

@@ -63,7 +63,7 @@ repos:
            '--extensions=c,h',
            '--headers=h',
            --recursive,
-            '--filter=-readability/casting,-runtime/int,-build/include_subdir'
+            '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'


I think this is a good check to keep in place - otherwise these functions get unwieldy

Unfortunately the function is now 522 lines long, whereas the limit for this check is 500

Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?

Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.

Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream

OK I'll see if I can upstream something, thanks!

WillAyd · 2022-12-27T21:38:06Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

@@ -139,6 +133,9 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc,
    while (sublen > 0 && isspace(*substr)) {
        ++substr;
        --sublen;
+        if (exact == PARTIAL_MATCH && !format_len) {


Can we not just make compare_format return a set of Enum depending on what is left in the string to consume and what the matching semantics are? Seems like it would naturally fit there rather than a separate branch every time

To clarify, I think you can return an enum from check_format of values like:

OK_EXACT OK_PARTIAL

etc... describing the different states, then branch in the caller appropriately

WillAyd · 2022-12-27T21:39:08Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.h

+ *      * NO_MATCH: don't require any match - parse without comparing
+ *                  with 'format'.
+ */
+enum Exact {


FYI this file is vendored from numpy. The ship has sailed a bit in terms of editing directly, but when we move to Meson and abandon setuptools its worth considering a split to put all of our custom logic into a separate library and leaving the vendored code in place (or upstreaming changes if they make sense for numpy)

MarcoGorelli

thanks for your review!

MarcoGorelli · 2022-12-28T10:57:16Z

.pre-commit-config.yaml

@@ -63,7 +63,7 @@ repos:
            '--extensions=c,h',
            '--headers=h',
            --recursive,
-            '--filter=-readability/casting,-runtime/int,-build/include_subdir'
+            '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'


Unfortunately the function is now 522 lines long, whereas the limit for this check is 500

Is it OK to turn it off now, or would you prefer a precursor PR to split up this function?

WillAyd · 2022-12-28T20:21:25Z

.pre-commit-config.yaml

@@ -63,7 +63,7 @@ repos:
            '--extensions=c,h',
            '--headers=h',
            --recursive,
-            '--filter=-readability/casting,-runtime/int,-build/include_subdir'
+            '--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'


Hmm not a great solution here. I think OK for now but something we should take care of in a follow up.

Ideally you could change numpy upstream to split the function (maybe split into a date / time parsing functions?). That way we wouldn't diverge too far from them when we bring that downstream

WillAyd · 2022-12-28T20:22:30Z

pandas/_libs/tslibs/np_datetime.pxd

@@ -120,3 +120,9 @@ cdef int64_t convert_reso(
    NPY_DATETIMEUNIT to_reso,
    bint round_ok,
 ) except? -1
+
+cdef extern from "src/datetime/np_datetime_strings.h":
+    cdef enum Exact:


I think the name Exact is a little too vague - maybe better as DatetimeFormatRequirement?

Yes good call, I've gone with FormatRequirement to keep lines not-too-long

WillAyd · 2022-12-28T20:23:14Z

pandas/_libs/tslibs/np_datetime.pxd

+    cdef enum Exact:
+        PARTIAL_MATCH
+        EXACT_MATCH
+        NO_MATCH


Does NoMatch really mean that the format is inferred?

You're right, I've renamed to INFER_FORMAT, thanks!

WillAyd · 2022-12-28T20:26:45Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

@@ -67,49 +67,59 @@ This file implements string parsing and creation for NumPy datetime.
 * Returns 0 on success, -1 on failure.
 */

+enum Comparison {


Design wise this assumes that the callee knows what the caller is doing and can instruct it on actions to take. I think it would be better to separate those entities and just have the callee report back what it knows.

With that in mind, maybe call rename this to DatetimePartParseResult and maybe have values of PARTIAL_MATCH, EXACT_MATCH, NO_MATCH. The caller can then choose to take action independent of this function

As in, to name the values the same way as those from FormatRequirement?

The issue is that different format requirements can result in the same result from this function - for example, both EXACT_MATCH where the format matches and INFER_FORMAT can return 0

I've renamed the values to

COMPARISON_SUCCESS, COMPLETED_PARTIAL_MATCH, COMPARISON_ERROR

, is that clearer?

WillAyd · 2022-12-28T20:27:53Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

+        int n,
+        const enum Exact exact
+) {
+  if (exact == PARTIAL_MATCH && !*characters_remaining) {


Suggested change

if (exact == PARTIAL_MATCH && !*characters_remaining) {

if (exact == PARTIAL_MATCH && *characters_remaining == 0) {

Nit but would be good to explicitly compare to 0. Depending on code structure we may also want to be careful what happens if characters_remaining somehow ends up as negative

WillAyd

looks pretty good. minor nits on typedefs otherwise lgtm

WillAyd · 2022-12-29T20:16:35Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

@@ -67,49 +67,59 @@ This file implements string parsing and creation for NumPy datetime.
 * Returns 0 on success, -1 on failure.
 */

+enum DatetimePartParseResult {


If you use typedef here you don't need to repeat enum every time you refer to this type

nice, thanks!

WillAyd · 2022-12-29T20:18:32Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.h

+ *           be able to parse it without error is '%Y-%m-%d';
+ *      * INFER_FORMAT: parse without comparing 'format' (i.e. infer it).
+ */
+enum FormatRequirement {


should typedef here as well

WillAyd · 2022-12-29T21:33:46Z

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

+        int n,
+        const FormatRequirement format_requirement
+) {
+  if (format_requirement == PARTIAL_MATCH && !*characters_remaining) {


also a nit but I think we need to handle characters_remaining being negative. It could just simply return a COMPARISON_ERROR right?

Understood it is impossible in the current state of things. However, if this gets refactored in the future and a negative number makes its way in here uncaught I think it would return a COMPARISON_SUCCCESS and be very difficult to troubleshoot without intimate knowledge of this function

yup, thanks for this (and other) thoughtful comments!

also a nit but I think we need to handle characters_remaining being 0

I presume you meant "less than 0" - that's what I've gone for anyway

WillAyd

lgtm on green

MarcoGorelli force-pushed the exact-inconsistencies branch 2 times, most recently from b366e31 to ee7f95e Compare December 25, 2022 11:09

MarcoGorelli marked this pull request as ready for review December 25, 2022 12:15

fixup

3de2331

MarcoGorelli added Bug Datetime Datetime data dtype labels Dec 25, 2022

MarcoGorelli commented Dec 25, 2022

View reviewed changes

pandas/_libs/tslibs/np_datetime.pxd Outdated Show resolved Hide resolved

MarcoGorelli force-pushed the exact-inconsistencies branch from b0cd67c to 3de2331 Compare December 25, 2022 17:04

MarcoGorelli requested review from WillAyd and mroeschke December 26, 2022 15:15

mroeschke approved these changes Dec 27, 2022

View reviewed changes

mroeschke added this to the 2.0 milestone Dec 27, 2022

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

947353d

WillAyd reviewed Dec 27, 2022

View reviewed changes

MarcoGorelli added 2 commits December 28, 2022 10:51

use enum

e3fe55b

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

9e18d33

MarcoGorelli commented Dec 28, 2022

View reviewed changes

WillAyd requested changes Dec 28, 2022

View reviewed changes

MarcoGorelli added 9 commits December 29, 2022 09:52

more descriptive names

efeaf7a

renaming fixup

6c51924

cast

b96158f

clean up

0ebff5c

doc

caa9c90

correct syntax

84eeb3d

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

f92ff7a

use typedef

5c67ed3

check for negative characters remaining

bad704e

WillAyd reviewed Dec 29, 2022

View reviewed changes

WillAyd requested changes Dec 29, 2022

View reviewed changes

WillAyd approved these changes Dec 29, 2022

View reviewed changes

MarcoGorelli added 2 commits December 30, 2022 14:43

Merge remote-tracking branch 'upstream/main' into exact-inconsistencies

ec6591b

reduce diff

8d8f90e

MarcoGorelli merged commit a28cadb into pandas-dev:main Dec 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

MarcoGorelli commented Dec 25, 2022 •

edited

Loading

WillAyd Dec 27, 2022

MarcoGorelli Dec 28, 2022

WillAyd Dec 28, 2022

MarcoGorelli Dec 29, 2022

WillAyd Dec 27, 2022

WillAyd Dec 27, 2022

WillAyd Dec 27, 2022

MarcoGorelli left a comment

MarcoGorelli Dec 28, 2022

WillAyd Dec 28, 2022

WillAyd Dec 28, 2022

MarcoGorelli Dec 29, 2022

WillAyd Dec 28, 2022

MarcoGorelli Dec 29, 2022

WillAyd Dec 28, 2022

MarcoGorelli Dec 29, 2022

WillAyd Dec 28, 2022

WillAyd left a comment

WillAyd Dec 29, 2022

MarcoGorelli Dec 29, 2022

WillAyd Dec 29, 2022

WillAyd Dec 29, 2022 •

edited

Loading

MarcoGorelli Dec 29, 2022

MarcoGorelli Dec 29, 2022

WillAyd left a comment

	if (exact == PARTIAL_MATCH && !*characters_remaining) {
	if (exact == PARTIAL_MATCH && *characters_remaining == 0) {

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

BUG: inconsistent handling of exact=False case in to_datetime parsing #50435

Conversation

MarcoGorelli commented Dec 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Dec 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Dec 25, 2022 •

edited

Loading

WillAyd Dec 29, 2022 •

edited

Loading