Skip to content

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Nov 17, 2022
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
45f82c3
initial format support
nikitaved Oct 26, 2022
e473adb
set exact=False default in objects_to_datetime
Oct 28, 2022
a6ea6d0
:label: typing
Oct 28, 2022
8fede1f
simplify
Oct 28, 2022
2e21e71
Merge remote-tracking branch 'upstream/main' into pr/nikitaved-qssumm…
Oct 28, 2022
afc4d96
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Oct 30, 2022
12721b0
replace macro with function
Oct 30, 2022
0531967
clean up
Oct 30, 2022
a571753
:memo: restore docstring
Oct 30, 2022
e814a2e
inline
Oct 30, 2022
19c34f8
set format default to None
Oct 31, 2022
70fb820
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Oct 31, 2022
eb50dfb
clean up
Oct 31, 2022
ac61ac5
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 6, 2022
0dd7407
remove function, perform check inline
Nov 6, 2022
4d35ea7
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 7, 2022
3acfdf6
only compare *format++ if format_len
Nov 7, 2022
f3060c9
clean up
Nov 7, 2022
7310e13
typing
Nov 7, 2022
b18ade7
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 9, 2022
3ceb1ee
split out branches
Nov 9, 2022
bde5ef9
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 13, 2022
080f018
use compare_format function
Nov 13, 2022
031e0e3
remove tmp variable
Nov 13, 2022
01e8bd1
Merge remote-tracking branch 'upstream/main' into pr/nikitaved/qssumm…
Nov 17, 2022
c1e6bc2
Add co-authors
Nov 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,13 @@ repos:
# this particular codebase (e.g. src/headers, src/klib). However,
# we can lint all header files since they aren't "generated" like C files are.
exclude: ^pandas/_libs/src/(klib|headers)/
args: [--quiet, '--extensions=c,h', '--headers=h', --recursive, '--filter=-readability/casting,-runtime/int,-build/include_subdir']
args: [
--quiet,
'--extensions=c,h',
'--headers=h',
--recursive,
'--filter=-readability/casting,-runtime/int,-build/include_subdir,-readability/fn_size'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_iso_8601_datetime is now more than 500 non-comment lines, so I've turned off that check for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate to this but I would also be fine exploring another tool aside from cpplint. I haven't seen many other projects use this tool, particularly for C instead of C++. clang has a formating and static analyzer tool that would likely be more useful

]
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -576,6 +576,7 @@ Conversion
- Bug in :meth:`Series.convert_dtypes` not converting dtype to nullable dtype when :class:`Series` contains ``NA`` and has dtype ``object`` (:issue:`48791`)
- Bug where any :class:`ExtensionDtype` subclass with ``kind="M"`` would be interpreted as a timezone type (:issue:`34986`)
- Bug in :class:`.arrays.ArrowExtensionArray` that would raise ``NotImplementedError`` when passed a sequence of strings or binary (:issue:`49172`)
- Bug in :func:`to_datetime` was not respecting ``exact`` argument when ``format`` was an ISO8601 format (:issue:`12649`)

Strings
^^^^^^^
Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/tslib.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ def array_to_datetime(
utc: bool = ...,
require_iso8601: bool = ...,
allow_mixed: bool = ...,
format: str | None = ...,
exact: bool = ...,
) -> tuple[np.ndarray, tzinfo | None]: ...

# returned ndarray may be object dtype or datetime64[ns]
Expand Down
16 changes: 14 additions & 2 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,8 @@ cpdef array_to_datetime(
bint utc=False,
bint require_iso8601=False,
bint allow_mixed=False,
format: str | None=None,
bint exact=True,
):
"""
Converts a 1D array of date-like values to a numpy array of either:
Expand Down Expand Up @@ -566,6 +568,16 @@ cpdef array_to_datetime(
iresult[i] = get_datetime64_nanos(val, NPY_FR_ns)

elif is_integer_object(val) or is_float_object(val):
if require_iso8601:
if is_coerce:
iresult[i] = NPY_NAT
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" at position {i} doesn't "
f"match format \"{format}\""
)
return values, tz_out
# these must be ns unit by-definition
seen_integer = True

Expand Down Expand Up @@ -596,7 +608,7 @@ cpdef array_to_datetime(

string_to_dts_failed = string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False
&out_tzoffset, False, format, exact
)
if string_to_dts_failed:
# An error at this point is a _parsing_ error
Expand All @@ -612,7 +624,7 @@ cpdef array_to_datetime(
elif is_raise:
raise ValueError(
f"time data \"{val}\" at position {i} doesn't "
"match format specified"
f"match format \"{format}\""
)
return values, tz_out

Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/tslibs/np_datetime.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ cdef int string_to_dts(
int* out_local,
int* out_tzoffset,
bint want_exc,
format: str | None = *,
bint exact = *
) except? -1

cdef NPY_DATETIMEUNIT get_unit_from_dtype(cnp.dtype dtype)
Expand Down
16 changes: 14 additions & 2 deletions pandas/_libs/tslibs/np_datetime.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@ cdef extern from "src/datetime/np_datetime_strings.h":
int parse_iso_8601_datetime(const char *str, int len, int want_exc,
npy_datetimestruct *out,
NPY_DATETIMEUNIT *out_bestunit,
int *out_local, int *out_tzoffset)
int *out_local, int *out_tzoffset,
const char *format, int format_len, int exact)


# ----------------------------------------------------------------------
Expand Down Expand Up @@ -277,14 +278,25 @@ cdef inline int string_to_dts(
int* out_local,
int* out_tzoffset,
bint want_exc,
format: str | None=None,
bint exact=True,
) except? -1:
cdef:
Py_ssize_t length
const char* buf
Py_ssize_t format_length
const char* format_buf

buf = get_c_string_buf_and_size(val, &length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to this but if you want to work more with the C codebase I think we should just replace get_c_string_buf_and_size with PyUnicode_AsUTF8AndSize directly; the former might have served a purpose with the Py2/3 transition but is just an unnecessary layer at this point

if format is None:
format_buf = b''
format_length = 0
exact = False
else:
format_buf = get_c_string_buf_and_size(format, &format_length)
return parse_iso_8601_datetime(buf, length, want_exc,
dts, out_bestunit, out_local, out_tzoffset)
dts, out_bestunit, out_local, out_tzoffset,
format_buf, format_length, exact)


cpdef ndarray astype_overflowsafe(
Expand Down
Loading