Skip to content

Commit 1d5f05c

Browse files
authored
PDEP0004: implementation (#49024)
* 🗑️ deprecate infer_datetime_format, make strict * 🚨 add warning about dayfirst * ✅ add/update tests * 🚨 add warning if format cant be guessed * 🥅 catch warnings * 📝 update docs * 📝 add example of reading csv file with mixed formats * 🗑️ removed now outdated tests / clean inputs * 📝 clarify whatsnew and user-guide * 🎨 * guess %Y-%m format * Detect format from first non-na, but also exclude now and today * ✅ fixup tests based on now and today parsing * fixup after merge * fixup after merge * fixup test * remove outdated doctest * xfail test based on issue 49767 * wip * add back examples of formats which can be guessed * start fixing up * fixups from reviews * lint * put tests back * shorten diff * add example of string which cannot be guessed * add deprecated directive, construct expected explicitly, explicit UserWarning, reword row-wise and column-wise * remove redundant example * restore newline * double backticks around False, explicitly raise UserWarning * reword warning * test both dayfirst True and False * postmerge fixup * unimportant typo to restart CI Co-authored-by: MarcoGorelli <>
1 parent cb223b3 commit 1d5f05c

File tree

25 files changed

+598
-544
lines changed

25 files changed

+598
-544
lines changed

doc/source/user_guide/basics.rst

+2
Original file line numberDiff line numberDiff line change
@@ -2312,6 +2312,7 @@ useful if you are reading in data which is mostly of the desired dtype (e.g. num
23122312
non-conforming elements intermixed that you want to represent as missing:
23132313

23142314
.. ipython:: python
2315+
:okwarning:
23152316
23162317
import datetime
23172318
@@ -2328,6 +2329,7 @@ The ``errors`` parameter has a third option of ``errors='ignore'``, which will s
23282329
encounters any errors with the conversion to a desired data type:
23292330

23302331
.. ipython:: python
2332+
:okwarning:
23312333
23322334
import datetime
23332335

doc/source/user_guide/io.rst

+19-14
Original file line numberDiff line numberDiff line change
@@ -968,17 +968,7 @@ To parse the mixed-timezone values as a datetime column, pass a partially-applie
968968
Inferring datetime format
969969
+++++++++++++++++++++++++
970970

971-
If you have ``parse_dates`` enabled for some or all of your columns, and your
972-
datetime strings are all formatted the same way, you may get a large speed
973-
up by setting ``infer_datetime_format=True``. If set, pandas will attempt
974-
to guess the format of your datetime strings, and then use a faster means
975-
of parsing the strings. 5-10x parsing speeds have been observed. pandas
976-
will fallback to the usual parsing if either the format cannot be guessed
977-
or the format that was guessed cannot properly parse the entire column
978-
of strings. So in general, ``infer_datetime_format`` should not have any
979-
negative consequences if enabled.
980-
981-
Here are some examples of datetime strings that can be guessed (All
971+
Here are some examples of datetime strings that can be guessed (all
982972
representing December 30th, 2011 at 00:00:00):
983973

984974
* "20111230"
@@ -988,21 +978,36 @@ representing December 30th, 2011 at 00:00:00):
988978
* "30/Dec/2011 00:00:00"
989979
* "30/December/2011 00:00:00"
990980

991-
Note that ``infer_datetime_format`` is sensitive to ``dayfirst``. With
981+
Note that format inference is sensitive to ``dayfirst``. With
992982
``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With
993983
``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th.
994984

985+
If you try to parse a column of date strings, pandas will attempt to guess the format
986+
from the first non-NaN element, and will then parse the rest of the column with that
987+
format. If pandas fails to guess the format (for example if your first string is
988+
``'01 December US/Pacific 2000'``), then a warning will be raised and each
989+
row will be parsed individually by ``dateutil.parser.parse``. The safest
990+
way to parse dates is to explicitly set ``format=``.
991+
995992
.. ipython:: python
996993
997-
# Try to infer the format for the index column
998994
df = pd.read_csv(
999995
"foo.csv",
1000996
index_col=0,
1001997
parse_dates=True,
1002-
infer_datetime_format=True,
1003998
)
1004999
df
10051000
1001+
In the case that you have mixed datetime formats within the same column, you'll need to
1002+
first read it in as an object dtype and then apply :func:`to_datetime` to each element.
1003+
1004+
.. ipython:: python
1005+
1006+
data = io.StringIO("date\n12 Jan 2000\n2000-01-13\n")
1007+
df = pd.read_csv(data)
1008+
df['date'] = df['date'].apply(pd.to_datetime)
1009+
df
1010+
10061011
.. ipython:: python
10071012
:suppress:
10081013

doc/source/user_guide/timeseries.rst

+7-6
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,8 @@ time.
132132

133133
.. ipython:: python
134134
135+
import datetime
136+
135137
pd.Timestamp(datetime.datetime(2012, 5, 1))
136138
pd.Timestamp("2012-05-01")
137139
pd.Timestamp(2012, 5, 1)
@@ -196,26 +198,25 @@ is converted to a ``DatetimeIndex``:
196198

197199
.. ipython:: python
198200
199-
pd.to_datetime(pd.Series(["Jul 31, 2009", "2010-01-10", None]))
201+
pd.to_datetime(pd.Series(["Jul 31, 2009", "Jan 10, 2010", None]))
200202
201-
pd.to_datetime(["2005/11/23", "2010.12.31"])
203+
pd.to_datetime(["2005/11/23", "2010/12/31"])
202204
203205
If you use dates which start with the day first (i.e. European style),
204206
you can pass the ``dayfirst`` flag:
205207

206208
.. ipython:: python
207-
:okwarning:
209+
:okwarning:
208210
209211
pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)
210212
211-
pd.to_datetime(["14-01-2012", "01-14-2012"], dayfirst=True)
213+
pd.to_datetime(["04-14-2012 10:00"], dayfirst=True)
212214
213215
.. warning::
214216

215217
You see in the above example that ``dayfirst`` isn't strict. If a date
216218
can't be parsed with the day being first it will be parsed as if
217-
``dayfirst`` were False, and in the case of parsing delimited date strings
218-
(e.g. ``31-12-2012``) then a warning will also be raised.
219+
``dayfirst`` were ``False`` and a warning will also be raised.
219220

220221
If you pass a single string to ``to_datetime``, it returns a single ``Timestamp``.
221222
``Timestamp`` can also accept string input, but it doesn't accept string parsing

doc/source/whatsnew/v2.0.0.rst

+33-1
Original file line numberDiff line numberDiff line change
@@ -411,6 +411,38 @@ Optional libraries below the lowest tested version may still work, but are not c
411411

412412
See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.
413413

414+
Datetimes are now parsed with a consistent format
415+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
416+
417+
In the past, :func:`to_datetime` guessed the format for each element independently. This was appropriate for some cases where elements had mixed date formats - however, it would regularly cause problems when users expected a consistent format but the function would switch formats between elements. As of version 2.0.0, parsing will use a consistent format, determined by the first non-NA value (unless the user specifies a format, in which case that is used).
418+
419+
*Old behavior*:
420+
421+
.. code-block:: ipython
422+
423+
In [1]: ser = pd.Series(['13-01-2000', '12-01-2000'])
424+
In [2]: pd.to_datetime(ser)
425+
Out[2]:
426+
0 2000-01-13
427+
1 2000-12-01
428+
dtype: datetime64[ns]
429+
430+
*New behavior*:
431+
432+
.. ipython:: python
433+
:okwarning:
434+
435+
ser = pd.Series(['13-01-2000', '12-01-2000'])
436+
pd.to_datetime(ser)
437+
438+
Note that this affects :func:`read_csv` as well.
439+
440+
If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime`
441+
to each element individually, e.g. ::
442+
443+
ser = pd.Series(['13-01-2000', '12 January 2000'])
444+
ser.apply(pd.to_datetime)
445+
414446
.. _whatsnew_200.api_breaking.other:
415447

416448
Other API changes
@@ -453,7 +485,7 @@ Other API changes
453485

454486
Deprecations
455487
~~~~~~~~~~~~
456-
-
488+
- Deprecated argument ``infer_datetime_format`` in :func:`to_datetime` and :func:`read_csv`, as a strict version of it is now the default (:issue:`48621`)
457489

458490
.. ---------------------------------------------------------------------------
459491

pandas/_libs/tslibs/parsing.pyx

+23
Original file line numberDiff line numberDiff line change
@@ -1032,6 +1032,7 @@ def guess_datetime_format(dt_str: str, bint dayfirst=False) -> str | None:
10321032
# rebuild string, capturing any inferred padding
10331033
dt_str = "".join(tokens)
10341034
if parsed_datetime.strftime(guessed_format) == dt_str:
1035+
_maybe_warn_about_dayfirst(guessed_format, dayfirst)
10351036
return guessed_format
10361037
else:
10371038
return None
@@ -1051,6 +1052,28 @@ cdef str _fill_token(token: str, padding: int):
10511052
token_filled = f"{seconds}.{nanoseconds}"
10521053
return token_filled
10531054

1055+
cdef void _maybe_warn_about_dayfirst(format: str, bint dayfirst):
1056+
"""Warn if guessed datetime format doesn't respect dayfirst argument."""
1057+
cdef:
1058+
int day_index = format.find("%d")
1059+
int month_index = format.find("%m")
1060+
1061+
if (day_index != -1) and (month_index != -1):
1062+
if (day_index > month_index) and dayfirst:
1063+
warnings.warn(
1064+
f"Parsing dates in {format} format when dayfirst=True was specified. "
1065+
"Pass `dayfirst=False` or specify a format to silence this warning.",
1066+
UserWarning,
1067+
stacklevel=find_stack_level(),
1068+
)
1069+
if (day_index < month_index) and not dayfirst:
1070+
warnings.warn(
1071+
f"Parsing dates in {format} format when dayfirst=False was specified. "
1072+
"Pass `dayfirst=True` or specify a format to silence this warning.",
1073+
UserWarning,
1074+
stacklevel=find_stack_level(),
1075+
)
1076+
10541077

10551078
@cython.wraparound(False)
10561079
@cython.boundscheck(False)

0 commit comments

Comments
 (0)