Skip to content

Commit d1095bc

Browse files
Datetime parsing (PDEP-4): allow mixture of ISO formatted strings (#50939)
* allow format iso8601 * fixup tests * 🏷️ typing * remove duplicate code * improve message, use if-statement * note that exact has no effect if format=iso8601 * point to format=ISO8601 in error message * allow format="mixed" * link to iso wiki page * minor fixups * double backticks -> single, suggest passing format * use format=mixed instead of apply in example; --------- Co-authored-by: MarcoGorelli <> Co-authored-by: Matthew Roeschke <[email protected]>
1 parent f3d4113 commit d1095bc

File tree

7 files changed

+186
-70
lines changed

7 files changed

+186
-70
lines changed

doc/source/user_guide/io.rst

+12-3
Original file line numberDiff line numberDiff line change
@@ -1001,14 +1001,23 @@ way to parse dates is to explicitly set ``format=``.
10011001
)
10021002
df
10031003
1004-
In the case that you have mixed datetime formats within the same column, you'll need to
1005-
first read it in as an object dtype and then apply :func:`to_datetime` to each element.
1004+
In the case that you have mixed datetime formats within the same column, you can
1005+
pass ``format='mixed'``
10061006

10071007
.. ipython:: python
10081008
10091009
data = io.StringIO("date\n12 Jan 2000\n2000-01-13\n")
10101010
df = pd.read_csv(data)
1011-
df['date'] = df['date'].apply(pd.to_datetime)
1011+
df['date'] = pd.to_datetime(df['date'], format='mixed')
1012+
df
1013+
1014+
or, if your datetime formats are all ISO8601 (possibly not identically-formatted):
1015+
1016+
.. ipython:: python
1017+
1018+
data = io.StringIO("date\n2020-01-01\n2020-01-01 03:00\n")
1019+
df = pd.read_csv(data)
1020+
df['date'] = pd.to_datetime(df['date'], format='ISO8601')
10121021
df
10131022
10141023
.. ipython:: python

doc/source/whatsnew/v2.0.0.rst

+10-3
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,8 @@ Other enhancements
311311
- Added :meth:`DatetimeIndex.as_unit` and :meth:`TimedeltaIndex.as_unit` to convert to different resolutions; supported resolutions are "s", "ms", "us", and "ns" (:issue:`50616`)
312312
- Added :meth:`Series.dt.unit` and :meth:`Series.dt.as_unit` to convert to different resolutions; supported resolutions are "s", "ms", "us", and "ns" (:issue:`51223`)
313313
- Added new argument ``dtype`` to :func:`read_sql` to be consistent with :func:`read_sql_query` (:issue:`50797`)
314+
- :func:`to_datetime` now accepts ``"ISO8601"`` as an argument to ``format``, which will match any ISO8601 string (but possibly not identically-formatted) (:issue:`50411`)
315+
- :func:`to_datetime` now accepts ``"mixed"`` as an argument to ``format``, which will infer the format for each element individually (:issue:`50972`)
314316
- Added new argument ``engine`` to :func:`read_json` to support parsing JSON with pyarrow by specifying ``engine="pyarrow"`` (:issue:`48893`)
315317
- Added support for SQLAlchemy 2.0 (:issue:`40686`)
316318
- :class:`Index` set operations :meth:`Index.union`, :meth:`Index.intersection`, :meth:`Index.difference`, and :meth:`Index.symmetric_difference` now support ``sort=True``, which will always return a sorted result, unlike the default ``sort=None`` which does not sort in some cases (:issue:`25151`)
@@ -738,11 +740,16 @@ In the past, :func:`to_datetime` guessed the format for each element independent
738740
739741
Note that this affects :func:`read_csv` as well.
740742

741-
If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime`
742-
to each element individually, e.g. ::
743+
If you still need to parse dates with inconsistent formats, you can use
744+
``format='mixed`` (possibly alongside ``dayfirst``) ::
743745

744746
ser = pd.Series(['13-01-2000', '12 January 2000'])
745-
ser.apply(pd.to_datetime)
747+
pd.to_datetime(ser, format='mixed', dayfirst=True)
748+
749+
or, if your formats are all ISO8601 (but possibly not identically-formatted) ::
750+
751+
ser = pd.Series(['2020-01-01', '2020-01-01 03:00'])
752+
pd.to_datetime(ser, format='ISO8601')
746753

747754
.. _whatsnew_200.api_breaking.other:
748755

pandas/_libs/tslibs/strptime.pyx

+44-26
Original file line numberDiff line numberDiff line change
@@ -186,6 +186,7 @@ def array_strptime(
186186
bint iso_format = format_is_iso(fmt)
187187
NPY_DATETIMEUNIT out_bestunit
188188
int out_local = 0, out_tzoffset = 0
189+
bint string_to_dts_succeeded = 0
189190

190191
assert is_raise or is_ignore or is_coerce
191192

@@ -306,53 +307,62 @@ def array_strptime(
306307
else:
307308
val = str(val)
308309

309-
if iso_format:
310-
string_to_dts_failed = string_to_dts(
310+
if fmt == "ISO8601":
311+
string_to_dts_succeeded = not string_to_dts(
312+
val, &dts, &out_bestunit, &out_local,
313+
&out_tzoffset, False, None, False
314+
)
315+
elif iso_format:
316+
string_to_dts_succeeded = not string_to_dts(
311317
val, &dts, &out_bestunit, &out_local,
312318
&out_tzoffset, False, fmt, exact
313319
)
314-
if not string_to_dts_failed:
315-
# No error reported by string_to_dts, pick back up
316-
# where we left off
317-
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
318-
if out_local == 1:
319-
# Store the out_tzoffset in seconds
320-
# since we store the total_seconds of
321-
# dateutil.tz.tzoffset objects
322-
tz = timezone(timedelta(minutes=out_tzoffset))
323-
result_timezone[i] = tz
324-
out_local = 0
325-
out_tzoffset = 0
326-
iresult[i] = value
327-
check_dts_bounds(&dts)
328-
continue
320+
if string_to_dts_succeeded:
321+
# No error reported by string_to_dts, pick back up
322+
# where we left off
323+
value = npy_datetimestruct_to_datetime(NPY_FR_ns, &dts)
324+
if out_local == 1:
325+
# Store the out_tzoffset in seconds
326+
# since we store the total_seconds of
327+
# dateutil.tz.tzoffset objects
328+
tz = timezone(timedelta(minutes=out_tzoffset))
329+
result_timezone[i] = tz
330+
out_local = 0
331+
out_tzoffset = 0
332+
iresult[i] = value
333+
check_dts_bounds(&dts)
334+
continue
329335

330336
if parse_today_now(val, &iresult[i], utc):
331337
continue
332338

333339
# Some ISO formats can't be parsed by string_to_dts
334-
# For example, 6-digit YYYYMD. So, if there's an error,
335-
# try the string-matching code below.
340+
# For example, 6-digit YYYYMD. So, if there's an error, and a format
341+
# was specified, then try the string-matching code below. If the format
342+
# specified was 'ISO8601', then we need to error, because
343+
# only string_to_dts handles mixed ISO8601 formats.
344+
if not string_to_dts_succeeded and fmt == "ISO8601":
345+
raise ValueError(f"Time data {val} is not ISO8601 format")
336346

337347
# exact matching
338348
if exact:
339349
found = format_regex.match(val)
340350
if not found:
341-
raise ValueError(f"time data \"{val}\" doesn't "
342-
f"match format \"{fmt}\"")
351+
raise ValueError(
352+
f"time data \"{val}\" doesn't match format \"{fmt}\""
353+
)
343354
if len(val) != found.end():
344355
raise ValueError(
345-
f"unconverted data remains: "
346-
f'"{val[found.end():]}"'
356+
"unconverted data remains when parsing with "
357+
f"format \"{fmt}\": \"{val[found.end():]}\""
347358
)
348359

349360
# search
350361
else:
351362
found = format_regex.search(val)
352363
if not found:
353364
raise ValueError(
354-
f"time data \"{val}\" doesn't match "
355-
f"format \"{fmt}\""
365+
f"time data \"{val}\" doesn't match format \"{fmt}\""
356366
)
357367

358368
iso_year = -1
@@ -504,7 +514,15 @@ def array_strptime(
504514
result_timezone[i] = tz
505515

506516
except (ValueError, OutOfBoundsDatetime) as ex:
507-
ex.args = (f"{str(ex)}, at position {i}",)
517+
ex.args = (
518+
f"{str(ex)}, at position {i}. You might want to try:\n"
519+
" - passing `format` if your strings have a consistent format;\n"
520+
" - passing `format='ISO8601'` if your strings are "
521+
"all ISO8601 but not necessarily in exactly the same format;\n"
522+
" - passing `format='mixed'`, and the format will be "
523+
"inferred for each element individually. "
524+
"You might want to use `dayfirst` alongside this.",
525+
)
508526
if is_coerce:
509527
iresult[i] = NPY_NAT
510528
continue

pandas/core/tools/datetimes.py

+13-5
Original file line numberDiff line numberDiff line change
@@ -445,7 +445,8 @@ def _convert_listlike_datetimes(
445445
if format is None:
446446
format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
447447

448-
if format is not None:
448+
# `format` could be inferred, or user didn't ask for mixed-format parsing.
449+
if format is not None and format != "mixed":
449450
return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
450451

451452
result, tz_parsed = objects_to_datetime64ns(
@@ -687,7 +688,7 @@ def to_datetime(
687688
yearfirst: bool = False,
688689
utc: bool = False,
689690
format: str | None = None,
690-
exact: bool = True,
691+
exact: bool | lib.NoDefault = lib.no_default,
691692
unit: str | None = None,
692693
infer_datetime_format: lib.NoDefault | bool = lib.no_default,
693694
origin: str = "unix",
@@ -717,9 +718,7 @@ def to_datetime(
717718
.. warning::
718719
719720
``dayfirst=True`` is not strict, but will prefer to parse
720-
with day first. If a delimited date string cannot be parsed in
721-
accordance with the given `dayfirst` option, e.g.
722-
``to_datetime(['31-12-2021'])``, then a warning will be shown.
721+
with day first.
723722
724723
yearfirst : bool, default False
725724
Specify a date parse order if `arg` is str or is list-like.
@@ -759,13 +758,20 @@ def to_datetime(
759758
<https://docs.python.org/3/library/datetime.html
760759
#strftime-and-strptime-behavior>`_ for more information on choices, though
761760
note that :const:`"%f"` will parse all the way up to nanoseconds.
761+
You can also pass:
762+
763+
- "ISO8601", to parse any `ISO8601 <https://en.wikipedia.org/wiki/ISO_8601>`_
764+
time string (not necessarily in exactly the same format);
765+
- "mixed", to infer the format for each element individually. This is risky,
766+
and you should probably use it along with `dayfirst`.
762767
exact : bool, default True
763768
Control how `format` is used:
764769
765770
- If :const:`True`, require an exact `format` match.
766771
- If :const:`False`, allow the `format` to match anywhere in the target
767772
string.
768773
774+
Cannot be used alongside ``format='ISO8601'`` or ``format='mixed'``.
769775
unit : str, default 'ns'
770776
The unit of the arg (D,s,ms,us,ns) denote the unit, which is an
771777
integer or float number. This will be based off the origin.
@@ -997,6 +1003,8 @@ def to_datetime(
9971003
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2020-01-01 18:00:00+00:00'],
9981004
dtype='datetime64[ns, UTC]', freq=None)
9991005
"""
1006+
if exact is not lib.no_default and format in {"mixed", "ISO8601"}:
1007+
raise ValueError("Cannot use 'exact' when 'format' is 'mixed' or 'ISO8601'")
10001008
if infer_datetime_format is not lib.no_default:
10011009
warnings.warn(
10021010
"The argument 'infer_datetime_format' is deprecated and will "

pandas/tests/io/parser/test_parse_dates.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1721,7 +1721,8 @@ def test_parse_multiple_delimited_dates_with_swap_warnings():
17211721
with pytest.raises(
17221722
ValueError,
17231723
match=(
1724-
r'^time data "31/05/2000" doesn\'t match format "%m/%d/%Y", at position 1$'
1724+
r'^time data "31/05/2000" doesn\'t match format "%m/%d/%Y", '
1725+
r"at position 1. You might want to try:"
17251726
),
17261727
):
17271728
pd.to_datetime(["01/01/2000", "31/05/2000", "31/05/2001", "01/02/2000"])

0 commit comments

Comments
 (0)