-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
45f82c3
e473adb
a6ea6d0
8fede1f
2e21e71
afc4d96
12721b0
0531967
a571753
e814a2e
19c34f8
70fb820
eb50dfb
ac61ac5
0dd7407
4d35ea7
3acfdf6
f3060c9
7310e13
b18ade7
3ceb1ee
bde5ef9
080f018
031e0e3
01e8bd1
c1e6bc2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -52,7 +52,8 @@ cdef extern from "src/datetime/np_datetime_strings.h": | |
int parse_iso_8601_datetime(const char *str, int len, int want_exc, | ||
npy_datetimestruct *out, | ||
NPY_DATETIMEUNIT *out_bestunit, | ||
int *out_local, int *out_tzoffset) | ||
int *out_local, int *out_tzoffset, | ||
const char *format, int format_len, int exact) | ||
|
||
|
||
# ---------------------------------------------------------------------- | ||
|
@@ -273,14 +274,20 @@ cdef inline int string_to_dts( | |
int* out_local, | ||
int* out_tzoffset, | ||
bint want_exc, | ||
str format, | ||
bint exact, | ||
) except? -1: | ||
cdef: | ||
Py_ssize_t length | ||
const char* buf | ||
Py_ssize_t format_length | ||
const char* format_buf | ||
|
||
buf = get_c_string_buf_and_size(val, &length) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Orthogonal to this but if you want to work more with the C codebase I think we should just replace |
||
format_buf = get_c_string_buf_and_size(format, &format_length) | ||
return parse_iso_8601_datetime(buf, length, want_exc, | ||
dts, out_bestunit, out_local, out_tzoffset) | ||
dts, out_bestunit, out_local, out_tzoffset, | ||
format_buf, format_length, exact) | ||
|
||
|
||
cpdef ndarray astype_overflowsafe( | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -66,10 +66,24 @@ This file implements string parsing and creation for NumPy datetime. | |
* | ||
* Returns 0 on success, -1 on failure. | ||
*/ | ||
|
||
#define FORMAT_STARTSWITH(ch) \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you make this a function instead of a macro? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The macro modifies two variables and has access to the goto location. A separate function then has to have at least two variables accepted with one of them being a pointer to a pointer. It also has to return a flag that needs a check every single time it is called. We can remove this pointer to a pointer, however, and modify just an offset variable, but checks are unavoidable. Is that a preferred approach? What do you think about checks being potentially redirected to the goto location in a macro? Or we could refactor the code and remove the goto once and for all, and handle the error message in a macro. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can also just do one function that performs the check and in the caller run the goto in the failing branch while advancing the pointers in the succeeding branch. It is more verbose but also more explicit than the macro. |
||
/* Always error on character mismatch conditioned on non-exhausted format, \ | ||
or when format is exhausted in the exact case. */ \ | ||
if ((format_len && *format != ch) || (exact && !format_len)){ \ | ||
goto parse_error; \ | ||
} \ | ||
/* Advance if format is not exhausted */ \ | ||
if (format_len) { \ | ||
++format; \ | ||
--format_len; \ | ||
} \ | ||
|
||
int parse_iso_8601_datetime(const char *str, int len, int want_exc, | ||
npy_datetimestruct *out, | ||
NPY_DATETIMEUNIT *out_bestunit, | ||
int *out_local, int *out_tzoffset) { | ||
int *out_local, int *out_tzoffset, | ||
const char* format, int format_len, int exact) { | ||
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
int year_leap = 0; | ||
int i, numdigits; | ||
const char *substr; | ||
|
@@ -104,14 +118,19 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
while (sublen > 0 && isspace(*substr)) { | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH(' '); | ||
} | ||
|
||
/* Leading '-' sign for negative year */ | ||
if (*substr == '-') { | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH('-'); | ||
} | ||
|
||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('Y'); | ||
|
||
if (sublen == 0) { | ||
goto parse_error; | ||
} | ||
|
@@ -139,6 +158,9 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
if (out_local != NULL) { | ||
*out_local = 0; | ||
} | ||
if (format_len) { | ||
goto parse_error; | ||
} | ||
bestunit = NPY_FR_Y; | ||
goto finish; | ||
} | ||
|
@@ -156,6 +178,7 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
ymd_sep = valid_ymd_sep[i]; | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH(ymd_sep); | ||
/* Cannot have trailing separator */ | ||
if (sublen == 0 || !isdigit(*substr)) { | ||
goto parse_error; | ||
|
@@ -167,6 +190,8 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
out->month = (*substr - '0'); | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('m'); | ||
/* Second digit optional if there was a separator */ | ||
if (isdigit(*substr)) { | ||
out->month = 10 * out->month + (*substr - '0'); | ||
|
@@ -190,6 +215,9 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
if (!has_ymd_sep) { | ||
goto parse_error; | ||
} | ||
if (format_len) { | ||
goto parse_error; | ||
} | ||
if (out_local != NULL) { | ||
*out_local = 0; | ||
} | ||
|
@@ -203,6 +231,7 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
} | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH(ymd_sep); | ||
} | ||
|
||
/* PARSE THE DAY */ | ||
|
@@ -213,6 +242,8 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
out->day = (*substr - '0'); | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('d'); | ||
/* Second digit optional if there was a separator */ | ||
if (isdigit(*substr)) { | ||
out->day = 10 * out->day + (*substr - '0'); | ||
|
@@ -235,13 +266,17 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
if (out_local != NULL) { | ||
*out_local = 0; | ||
} | ||
if (format_len) { | ||
goto parse_error; | ||
} | ||
MarcoGorelli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
bestunit = NPY_FR_D; | ||
goto finish; | ||
} | ||
|
||
if ((*substr != 'T' && *substr != ' ') || sublen == 1) { | ||
goto parse_error; | ||
} | ||
FORMAT_STARTSWITH(*substr); | ||
++substr; | ||
--sublen; | ||
|
||
|
@@ -250,6 +285,8 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
if (!isdigit(*substr)) { | ||
goto parse_error; | ||
} | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('H'); | ||
out->hour = (*substr - '0'); | ||
++substr; | ||
--sublen; | ||
|
@@ -274,6 +311,9 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
if (!hour_was_2_digits) { | ||
goto parse_error; | ||
} | ||
if (format_len) { | ||
goto parse_error; | ||
} | ||
bestunit = NPY_FR_h; | ||
goto finish; | ||
} | ||
|
@@ -286,6 +326,7 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
if (sublen == 0 || !isdigit(*substr)) { | ||
goto parse_error; | ||
} | ||
FORMAT_STARTSWITH(':'); | ||
} else if (!isdigit(*substr)) { | ||
if (!hour_was_2_digits) { | ||
goto parse_error; | ||
|
@@ -298,6 +339,8 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
out->min = (*substr - '0'); | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('M'); | ||
/* Second digit optional if there was a separator */ | ||
if (isdigit(*substr)) { | ||
out->min = 10 * out->min + (*substr - '0'); | ||
|
@@ -317,12 +360,16 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
|
||
if (sublen == 0) { | ||
bestunit = NPY_FR_m; | ||
if (format_len) { | ||
goto parse_error; | ||
} | ||
goto finish; | ||
} | ||
|
||
/* If we make it through this condition block, then the next | ||
* character is a digit. */ | ||
if (has_hms_sep && *substr == ':') { | ||
FORMAT_STARTSWITH(':'); | ||
++substr; | ||
--sublen; | ||
/* Cannot have a trailing ':' */ | ||
|
@@ -339,6 +386,8 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
out->sec = (*substr - '0'); | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('S'); | ||
/* Second digit optional if there was a separator */ | ||
if (isdigit(*substr)) { | ||
out->sec = 10 * out->sec + (*substr - '0'); | ||
|
@@ -360,12 +409,15 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
if (sublen > 0 && *substr == '.') { | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH('.'); | ||
} else { | ||
bestunit = NPY_FR_s; | ||
goto parse_timezone; | ||
} | ||
|
||
/* PARSE THE MICROSECONDS (0 to 6 digits) */ | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('f'); | ||
numdigits = 0; | ||
for (i = 0; i < 6; ++i) { | ||
out->us *= 10; | ||
|
@@ -430,15 +482,22 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
while (sublen > 0 && isspace(*substr)) { | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH(' '); | ||
} | ||
|
||
if (sublen == 0) { | ||
// Unlike NumPy, treating no time zone as naive | ||
if (format_len > 0) { | ||
goto parse_error; | ||
} | ||
goto finish; | ||
} | ||
|
||
/* UTC specifier */ | ||
if (*substr == 'Z') { | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('Z'); | ||
|
||
/* "Z" should be equivalent to tz offset "+00:00" */ | ||
if (out_local != NULL) { | ||
*out_local = 1; | ||
|
@@ -449,12 +508,17 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
} | ||
|
||
if (sublen == 1) { | ||
if (format_len > 0) { | ||
goto parse_error; | ||
} | ||
goto finish; | ||
} else { | ||
++substr; | ||
--sublen; | ||
} | ||
} else if (*substr == '-' || *substr == '+') { | ||
FORMAT_STARTSWITH('%'); | ||
FORMAT_STARTSWITH('z'); | ||
/* Time zone offset */ | ||
int offset_neg = 0, offset_hour = 0, offset_minute = 0; | ||
|
||
|
@@ -538,9 +602,10 @@ int parse_iso_8601_datetime(const char *str, int len, int want_exc, | |
while (sublen > 0 && isspace(*substr)) { | ||
++substr; | ||
--sublen; | ||
FORMAT_STARTSWITH(' '); | ||
} | ||
|
||
if (sublen != 0) { | ||
if ((sublen != 0) || (format_len != 0)) { | ||
goto parse_error; | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use None as a default rather than
""
? I think can just type as Optional[str] which would be a bit cleanerThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, could do that here, but then wouldn't it still need to be
str
when it's passed to the C function inpandas/pandas/_libs/tslibs/np_datetime.pyx
Lines 287 to 290 in 45f82c3
? If I understand correctly, at some point the conversion from
None
to""
would have to happen anywayThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If None can be interpreted as nullptr (can we just pass
0
?), I agree that having None is a cleaner option for something that is user-facing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't user-facing, it's just used internally
Not sure I understand about None being interpreted as nullptr, but if I pass
None
toget_c_string_buf_and_size
I getSo I think the conversion between
None
andstr
needs to happen at some point internally anywayIf
exact
if False andformat
is''
, then it should match anythingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about?
The rest of the code should work just fine given that the checks are conditioned on
format_lenght
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yeah, looks good - I presume
format_buf
would have to be'\0'
rather than0
though?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as
format_len
is0
,*format_buf
could be anything, so it is a bit clearer to have the pointer set toNULL
, imho.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how though? because if it's
0
, then we getAnd with
None
:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, it is a cast issue. Does Cython support casts to C types? If so, something like
format_buf = (const char*) 0
should do. Or, maybe, we can specify the type offormat_buf
in advance... Sorry, I am not an expert in Cython :)Maybe this will work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried
but then get
(with no further info)
For now I've gone with
format = b''
, and handlingNone
inside this function does indeed look cleaner, thanks!