Skip to content

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
2 changes: 2 additions & 0 deletions doc/source/user_guide/basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2313,6 +2313,7 @@ useful if you are reading in data which is mostly of the desired dtype (e.g. num
non-conforming elements intermixed that you want to represent as missing:

.. ipython:: python
:okwarning:

import datetime

Expand All @@ -2329,6 +2330,7 @@ The ``errors`` parameter has a third option of ``errors='ignore'``, which will s
encounters any errors with the conversion to a desired data type:

.. ipython:: python
:okwarning:

import datetime

Expand Down
38 changes: 13 additions & 25 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1009,41 +1009,29 @@ To parse the mixed-timezone values as a datetime column, pass a partially-applie
Inferring datetime format
+++++++++++++++++++++++++

If you have ``parse_dates`` enabled for some or all of your columns, and your
datetime strings are all formatted the same way, you may get a large speed
up by setting ``infer_datetime_format=True``. If set, pandas will attempt
to guess the format of your datetime strings, and then use a faster means
of parsing the strings. 5-10x parsing speeds have been observed. pandas
will fallback to the usual parsing if either the format cannot be guessed
or the format that was guessed cannot properly parse the entire column
of strings. So in general, ``infer_datetime_format`` should not have any
negative consequences if enabled.

Here are some examples of datetime strings that can be guessed (All
representing December 30th, 2011 at 00:00:00):

* "20111230"
* "2011/12/30"
* "20111230 00:00:00"
* "12/30/2011 00:00:00"
* "30/Dec/2011 00:00:00"
* "30/December/2011 00:00:00"

Note that ``infer_datetime_format`` is sensitive to ``dayfirst``. With
``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With
``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th.
If you try to parse a column of date strings, pandas will attempt to guess the format
from the first non-NaN element, and will then parse the rest of the column with that
format.

.. ipython:: python

# Try to infer the format for the index column
df = pd.read_csv(
"foo.csv",
index_col=0,
parse_dates=True,
infer_datetime_format=True,
)
df

In the case that you have mixed datetime formats within the same column, you'll need to
first read it in as an object dtype and then apply :func:`to_datetime` to each element.

.. ipython:: python

data = io.StringIO("date\n12 Jan 2000\n2000-01-13\n")
df = pd.read_csv(data)
df['date'] = df['date'].apply(pd.to_datetime)
df

.. ipython:: python
:suppress:

Expand Down
27 changes: 8 additions & 19 deletions doc/source/user_guide/timeseries.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,6 @@ a tremendous amount of new functionality for manipulating time series data.

For example, pandas supports:

Parsing time series information from various sources and formats

.. ipython:: python

import datetime

dti = pd.to_datetime(
["1/1/2018", np.datetime64("2018-01-01"), datetime.datetime(2018, 1, 1)]
)
dti

Generate sequences of fixed-frequency dates and time spans

.. ipython:: python
Expand Down Expand Up @@ -132,6 +121,8 @@ time.

.. ipython:: python

import datetime

pd.Timestamp(datetime.datetime(2012, 5, 1))
pd.Timestamp("2012-05-01")
pd.Timestamp(2012, 5, 1)
Expand Down Expand Up @@ -196,26 +187,24 @@ is converted to a ``DatetimeIndex``:

.. ipython:: python

pd.to_datetime(pd.Series(["Jul 31, 2009", "2010-01-10", None]))
pd.to_datetime(pd.Series(["Jul 31, 2009", "Jan 10, 2010", None]))

pd.to_datetime(["2005/11/23", "2010.12.31"])
pd.to_datetime(["2005/11/23", "2010/12/31"])

If you use dates which start with the day first (i.e. European style),
you can pass the ``dayfirst`` flag:

.. ipython:: python
:okwarning:
:okwarning:

pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)

pd.to_datetime(["14-01-2012", "01-14-2012"], dayfirst=True)
pd.to_datetime(["04-14-2012 10:00"], dayfirst=True)

.. warning::

You see in the above example that ``dayfirst`` isn't strict. If a date
can't be parsed with the day being first it will be parsed as if
``dayfirst`` were False, and in the case of parsing delimited date strings
(e.g. ``31-12-2012``) then a warning will also be raised.
``dayfirst`` were False and a warning will also be raised.

If you pass a single string to ``to_datetime``, it returns a single ``Timestamp``.
``Timestamp`` can also accept string input, but it doesn't accept string parsing
Expand Down Expand Up @@ -768,7 +757,7 @@ partially matching dates:
rng2 = pd.date_range("2011-01-01", "2012-01-01", freq="W")
ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)

ts2.truncate(before="2011-11", after="2011-12")
ts2.truncate(before="2011-11-01", after="2011-12-01")
ts2["2011-11":"2011-12"]

Even complicated fancy indexing that breaks the ``DatetimeIndex`` frequency
Expand Down
33 changes: 33 additions & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,39 @@ Optional libraries below the lowest tested version may still work, but are not c

See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.

Datetimes are now parsed with a consistent format
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`to_datetime` now parses dates with a consistent format, which is guessed from the first non-NA value
(unless ``format`` is specified). Previously, it would have guessed the format for each element individually.

*Old behavior*:

.. code-block:: ipython

In [1]: ser = pd.Series(['13-01-2000', '12-01-2000'])
In [2]: pd.to_datetime(ser)
Out[2]:
0 2000-01-13
1 2000-12-01
dtype: datetime64[ns]

*New behavior*:

.. ipython:: python
:okwarning:

ser = pd.Series(['13-01-2000', '12-01-2000'])
pd.to_datetime(ser)

Note that this affects :func:`read_csv` as well.

If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime`
to each element individually, e.g. ::

ser = pd.Series(['13-01-2000', '12 January 2000'])
ser.apply(pd.to_datetime)

.. _whatsnew_200.api_breaking.other:

Other API changes
Expand Down
20 changes: 20 additions & 0 deletions pandas/_libs/tslib.pxd
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from pandas._libs.tslibs.np_datetime cimport (
NPY_DATETIMEUNIT,
npy_datetimestruct,
)


cdef extern from "src/datetime/np_datetime_strings.h":
ctypedef struct ISOInfo:
const char *format
int format_len
const char *date_sep
const char *time_sep
const char *micro_or_tz
int year
int month
int day
int hour
int minute
int second
int exact
34 changes: 31 additions & 3 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,22 @@ def _test_parse_iso8601(ts: str):
_TSObject obj
int out_local = 0, out_tzoffset = 0
NPY_DATETIMEUNIT out_bestunit
ISOInfo iso_info

iso_info = ISOInfo(
format='',
format_len=0,
date_sep='',
time_sep='',
micro_or_tz='',
year=False,
month=False,
day=False,
hour=False,
minute=False,
second=False,
exact=False,
)

obj = _TSObject()

Expand All @@ -93,7 +109,7 @@ def _test_parse_iso8601(ts: str):
elif ts == 'today':
return Timestamp.now().normalize()

string_to_dts(ts, &obj.dts, &out_bestunit, &out_local, &out_tzoffset, True)
string_to_dts(ts, &obj.dts, &out_bestunit, &out_local, &out_tzoffset, True, &iso_info)
obj.value = npy_datetimestruct_to_datetime(NPY_FR_ns, &obj.dts)
check_dts_bounds(&obj.dts)
if out_local == 1:
Expand Down Expand Up @@ -443,6 +459,7 @@ def first_non_null(values: ndarray) -> int:
@cython.boundscheck(False)
cpdef array_to_datetime(
ndarray[object] values,
ISOInfo iso_info,
str errors='raise',
bint dayfirst=False,
bint yearfirst=False,
Expand Down Expand Up @@ -510,6 +527,7 @@ cpdef array_to_datetime(
tzinfo tz_out = None
bint found_tz = False, found_naive = False


# specify error conditions
assert is_raise or is_ignore or is_coerce

Expand Down Expand Up @@ -568,6 +586,16 @@ cpdef array_to_datetime(
iresult[i] = get_datetime64_nanos(val, NPY_FR_ns)

elif is_integer_object(val) or is_float_object(val):
if require_iso8601:
if is_coerce:
iresult[i] = NPY_NAT
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" at position {i} doesn't match format {iso_info.format.decode('utf-8')}"
)
return values, tz_out

# these must be ns unit by-definition
seen_integer = True

Expand Down Expand Up @@ -598,7 +626,7 @@ cpdef array_to_datetime(

string_to_dts_failed = string_to_dts(
val, &dts, &out_bestunit, &out_local,
&out_tzoffset, False
&out_tzoffset, False, &iso_info,
)
if string_to_dts_failed:
# An error at this point is a _parsing_ error
Expand All @@ -613,7 +641,7 @@ cpdef array_to_datetime(
continue
elif is_raise:
raise ValueError(
f"time data \"{val}\" at position {i} doesn't match format specified"
f"time data \"{val}\" at position {i} doesn't match format {iso_info.format.decode('utf-8')}"
)
return values, tz_out

Expand Down
15 changes: 15 additions & 0 deletions pandas/_libs/tslibs/conversion.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,18 @@ cdef int64_t cast_from_unit(object ts, str unit) except? -1
cpdef (int64_t, int) precision_from_unit(str unit)

cdef maybe_localize_tso(_TSObject obj, tzinfo tz, NPY_DATETIMEUNIT reso)

cdef extern from "src/datetime/np_datetime_strings.h":
ctypedef struct ISOInfo:
const char *format
int format_len
const char *date_sep
const char *time_sep
const char *micro_or_tz
int year
int month
int day
int hour
int minute
int second
int exact
18 changes: 17 additions & 1 deletion pandas/_libs/tslibs/conversion.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -469,6 +469,22 @@ cdef _TSObject _convert_str_to_tsobject(object ts, tzinfo tz, str unit,
datetime dt
int64_t ival
NPY_DATETIMEUNIT out_bestunit
ISOInfo iso_info

iso_info = ISOInfo(
format='',
format_len=0,
date_sep='',
time_sep='',
micro_or_tz='',
year=False,
month=False,
day=False,
hour=False,
minute=False,
second=False,
exact=False,
)

if len(ts) == 0 or ts in nat_strings:
ts = NaT
Expand All @@ -488,7 +504,7 @@ cdef _TSObject _convert_str_to_tsobject(object ts, tzinfo tz, str unit,
else:
string_to_dts_failed = string_to_dts(
ts, &dts, &out_bestunit, &out_local,
&out_tzoffset, False
&out_tzoffset, False, &iso_info,
)
if not string_to_dts_failed:
try:
Expand Down
17 changes: 17 additions & 0 deletions pandas/_libs/tslibs/np_datetime.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ cdef int string_to_dts(
int* out_local,
int* out_tzoffset,
bint want_exc,
ISOInfo* iso_info,
) except? -1

cdef NPY_DATETIMEUNIT get_unit_from_dtype(cnp.dtype dtype)
Expand All @@ -118,3 +119,19 @@ cdef int64_t convert_reso(
NPY_DATETIMEUNIT to_reso,
bint round_ok,
) except? -1

cdef extern from "src/datetime/np_datetime_strings.h":

ctypedef struct ISOInfo:
const char *format
int format_len
const char *date_sep
const char *time_sep
const char *micro_or_tz
int year
int month
int day
int hour
int minute
int second
int exact
Loading