Skip to content

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Nov 17, 2022
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
45f82c3
initial format support
nikitaved Oct 26, 2022
e473adb
set exact=False default in objects_to_datetime
Oct 28, 2022
a6ea6d0
:label: typing
Oct 28, 2022
8fede1f
simplify
Oct 28, 2022
2e21e71
Merge remote-tracking branch 'upstream/main' into pr/nikitaved-qssumm…
Oct 28, 2022
afc4d96
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Oct 30, 2022
12721b0
replace macro with function
Oct 30, 2022
0531967
clean up
Oct 30, 2022
a571753
:memo: restore docstring
Oct 30, 2022
e814a2e
inline
Oct 30, 2022
19c34f8
set format default to None
Oct 31, 2022
70fb820
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Oct 31, 2022
eb50dfb
clean up
Oct 31, 2022
ac61ac5
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 6, 2022
0dd7407
remove function, perform check inline
Nov 6, 2022
4d35ea7
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 7, 2022
3acfdf6
only compare *format++ if format_len
Nov 7, 2022
f3060c9
clean up
Nov 7, 2022
7310e13
typing
Nov 7, 2022
b18ade7
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 9, 2022
3ceb1ee
split out branches
Nov 9, 2022
bde5ef9
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 13, 2022
080f018
use compare_format function
Nov 13, 2022
031e0e3
remove tmp variable
Nov 13, 2022
01e8bd1
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 17, 2022
c1e6bc2
Add co-authors
Nov 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,14 @@ repos:
# this particular codebase (e.g. src/headers, src/klib). However,
# we can lint all header files since they aren't "generated" like C files are.
exclude: ^pandas/_libs/src/(klib|headers)/
args: [--quiet, '--extensions=c,h', '--headers=h', --recursive, '--filter=-readability/casting,-runtime/int,-build/include_subdir']
args: [
--quiet,
'--extensions=c,h',
'--headers=h',
--recursive,
'--filter=-readability/casting,-runtime/int,-build/include_subdir',
'--linelength=88'
]
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -351,6 +351,7 @@ Conversion
- Bug in :meth:`Series.convert_dtypes` not converting dtype to nullable dtype when :class:`Series` contains ``NA`` and has dtype ``object`` (:issue:`48791`)
- Bug where any :class:`ExtensionDtype` subclass with ``kind="M"`` would be interpreted as a timezone type (:issue:`34986`)
- Bug in :class:`.arrays.ArrowExtensionArray` that would raise ``NotImplementedError`` when passed a sequence of strings or binary (:issue:`49172`)
- Bug in :func:`to_datetime` was not respecting ``exact`` argument when ``format`` was an ISO8601 format (:issue:`12649`)

Strings
^^^^^^^
Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/tslib.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ def array_to_datetime(
utc: bool = ...,
require_iso8601: bool = ...,
allow_mixed: bool = ...,
format: str = ...,
exact: bool = ...,
) -> tuple[np.ndarray, tzinfo | None]: ...

# returned ndarray may be object dtype or datetime64[ns]
17 changes: 14 additions & 3 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def _test_parse_iso8601(ts: str):
elif ts == 'today':
return Timestamp.now().normalize()

string_to_dts(ts, &obj.dts, &out_bestunit, &out_local, &out_tzoffset, True)
string_to_dts(ts, &obj.dts, &out_bestunit, &out_local, &out_tzoffset, True, "", False)
obj.value = npy_datetimestruct_to_datetime(NPY_FR_ns, &obj.dts)
check_dts_bounds(&obj.dts)
if out_local == 1:
Expand Down Expand Up @@ -445,6 +445,8 @@ cpdef array_to_datetime(
bint utc=False,
bint require_iso8601=False,
bint allow_mixed=False,
str format="",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use None as a default rather than ""? I think can just type as Optional[str] which would be a bit cleaner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, could do that here, but then wouldn't it still need to be str when it's passed to the C function in

format_buf = get_c_string_buf_and_size(format, &format_length)
return parse_iso_8601_datetime(buf, length, want_exc,
dts, out_bestunit, out_local, out_tzoffset,
format_buf, format_length, exact)

? If I understand correctly, at some point the conversion from None to "" would have to happen anyway

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If None can be interpreted as nullptr (can we just pass 0?), I agree that having None is a cleaner option for something that is user-facing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't user-facing, it's just used internally

Not sure I understand about None being interpreted as nullptr, but if I pass None to get_c_string_buf_and_size I get

Traceback (most recent call last):
  File "t.py", line 109, in <module>
    print(to_datetime('2012-01-01T10:00', format='%Y-%m-%dT%H:%M', exact=False))
  File "/home/marcogorelli/pandas-dev/pandas/core/tools/datetimes.py", line 1117, in to_datetime
    result = convert_listlike(np.array([arg]), format)[0]
  File "/home/marcogorelli/pandas-dev/pandas/core/tools/datetimes.py", line 441, in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
  File "/home/marcogorelli/pandas-dev/pandas/core/arrays/datetimes.py", line 2177, in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 440, in pandas._libs.tslib.array_to_datetime
    cpdef array_to_datetime(
  File "pandas/_libs/tslib.pyx", line 706, in pandas._libs.tslib.array_to_datetime
    raise
  File "pandas/_libs/tslib.pyx", line 606, in pandas._libs.tslib.array_to_datetime
    string_to_dts_failed = string_to_dts(
  File "pandas/_libs/tslibs/np_datetime.pyx", line 288, in pandas._libs.tslibs.np_datetime.string_to_dts
    format_buf = get_c_string_buf_and_size(None, &format_length)
  File "pandas/_libs/tslibs/util.pxd", line 221, in pandas._libs.tslibs.util.get_c_string_buf_and_size
    return PyUnicode_AsUTF8AndSize(py_string, length)
TypeError: bad argument type for built-in operation

So I think the conversion between None and str needs to happen at some point internally anyway

If exact if False and format is '', then it should match anything

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about?

if format is None:
  # Make format_buf a nullptr
  format_buf = 0
  format_lenght = 0
else:
  format_buf = get_c_string_buf_and_size(..., &format_length)
...

return parse_iso...

The rest of the code should work just fine given that the checks are conditioned on format_lenght.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah, looks good - I presume format_buf would have to be '\0' rather than 0 though?

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as format_len is 0, *format_buf could be anything, so it is a bit clearer to have the pointer set to NULL, imho.

Copy link
Member

@MarcoGorelli MarcoGorelli Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how though? because if it's 0, then we get

[1/1] Cythonizing pandas/_libs/tslibs/np_datetime.pyx

Error compiling Cython file:
------------------------------------------------------------
...

    buf = get_c_string_buf_and_size(val, &length)
    format_buf = get_c_string_buf_and_size(format, &format_length)
    return parse_iso_8601_datetime(buf, length, want_exc,
                                   dts, out_bestunit, out_local, out_tzoffset,
                                   0, 0, exact)
                                  ^
------------------------------------------------------------

pandas/_libs/tslibs/np_datetime.pyx:290:35: Cannot assign type 'long' to 'const char *'

And with None:

Error compiling Cython file:
------------------------------------------------------------
...

    buf = get_c_string_buf_and_size(val, &length)
    format_buf = get_c_string_buf_and_size(format, &format_length)
    return parse_iso_8601_datetime(buf, length, want_exc,
                                   dts, out_bestunit, out_local, out_tzoffset,
                                   None, 0, exact)
                                  ^
------------------------------------------------------------

pandas/_libs/tslibs/np_datetime.pyx:290:35: Cannot assign None to const char *

Copy link
Contributor Author

@nikitaved nikitaved Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, it is a cast issue. Does Cython support casts to C types? If so, something like format_buf = (const char*) 0 should do. Or, maybe, we can specify the type of format_buf in advance... Sorry, I am not an expert in Cython :)
Maybe this will work?

if format is None:
  format_buf: const char* = 0
  format_len = 0
else:
  format_buf = get_c_string_buf_and_size(val, &format_len)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried

format_buf = <const char*> 0

but then get

Segmentation fault

(with no further info)

For now I've gone with format = b'', and handling None inside this function does indeed look cleaner, thanks!

bint exact=False,
):
"""
Converts a 1D array of date-like values to a numpy array of either:
Expand Down Expand Up @@ -564,6 +566,15 @@ cpdef array_to_datetime(
iresult[i] = get_datetime64_nanos(val, NPY_FR_ns)

elif is_integer_object(val) or is_float_object(val):
if require_iso8601:
if is_coerce:
iresult[i] = NPY_NAT
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" at position {i} doesn't match format \"{format}\""
)
return values, tz_out
# these must be ns unit by-definition
seen_integer = True

Expand Down Expand Up @@ -594,7 +605,7 @@ cpdef array_to_datetime(

string_to_dts_failed = string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False
&out_tzoffset, False, format, exact
)
if string_to_dts_failed:
# An error at this point is a _parsing_ error
Expand All @@ -609,7 +620,7 @@ cpdef array_to_datetime(
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" at position {i} doesn't match format specified"
f"time data \"{val}\" at position {i} doesn't match format \"{format}\""
)
return values, tz_out

Expand Down
2 changes: 1 addition & 1 deletion pandas/_libs/tslibs/conversion.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -488,7 +488,7 @@ cdef _TSObject _convert_str_to_tsobject(object ts, tzinfo tz, str unit,
else:
string_to_dts_failed = string_to_dts(
ts, &dts, &out_bestunit, &out_local,
&out_tzoffset, False
&out_tzoffset, False, "", False
)
if not string_to_dts_failed:
try:
Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/tslibs/np_datetime.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ cdef int string_to_dts(
int* out_local,
int* out_tzoffset,
bint want_exc,
str format,
bint exact
) except? -1

cdef NPY_DATETIMEUNIT get_unit_from_dtype(cnp.dtype dtype)
Expand Down
11 changes: 9 additions & 2 deletions pandas/_libs/tslibs/np_datetime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@ cdef extern from "src/datetime/np_datetime_strings.h":
int parse_iso_8601_datetime(const char *str, int len, int want_exc,
npy_datetimestruct *out,
NPY_DATETIMEUNIT *out_bestunit,
int *out_local, int *out_tzoffset)
int *out_local, int *out_tzoffset,
const char *format, int format_len, int exact)


# ----------------------------------------------------------------------
Expand Down Expand Up @@ -273,14 +274,20 @@ cdef inline int string_to_dts(
int* out_local,
int* out_tzoffset,
bint want_exc,
str format,
bint exact,
) except? -1:
cdef:
Py_ssize_t length
const char* buf
Py_ssize_t format_length
const char* format_buf

buf = get_c_string_buf_and_size(val, &length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to this but if you want to work more with the C codebase I think we should just replace get_c_string_buf_and_size with PyUnicode_AsUTF8AndSize directly; the former might have served a purpose with the Py2/3 transition but is just an unnecessary layer at this point

format_buf = get_c_string_buf_and_size(format, &format_length)
return parse_iso_8601_datetime(buf, length, want_exc,
dts, out_bestunit, out_local, out_tzoffset)
dts, out_bestunit, out_local, out_tzoffset,
format_buf, format_length, exact)


cpdef ndarray astype_overflowsafe(
Expand Down
2 changes: 1 addition & 1 deletion pandas/_libs/tslibs/parsing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ cdef parse_datetime_string_with_reso(
# TODO: does this render some/all of parse_delimited_date redundant?
string_to_dts_failed = string_to_dts(
date_string, &dts, &out_bestunit, &out_local,
&out_tzoffset, False
&out_tzoffset, False, "", False
)
if not string_to_dts_failed:
if dts.ps != 0 or out_local:
Expand Down
Loading