pandas-dev · MarcoGorelli · Oct 18, 2022 · Oct 18, 2022 · Oct 18, 2022 · Oct 18, 2022
diff --git a/doc/source/user_guide/basics.rst b/doc/source/user_guide/basics.rst
@@ -2313,6 +2313,7 @@ useful if you are reading in data which is mostly of the desired dtype (e.g. num
 non-conforming elements intermixed that you want to represent as missing:
 
 .. ipython:: python
+   :okwarning:
 
     import datetime
 
@@ -2329,6 +2330,7 @@ The ``errors`` parameter has a third option of ``errors='ignore'``, which will s
 encounters any errors with the conversion to a desired data type:
 
 .. ipython:: python
+    :okwarning:
 
     import datetime
 

diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst
@@ -1009,41 +1009,29 @@ To parse the mixed-timezone values as a datetime column, pass a partially-applie
 Inferring datetime format
 +++++++++++++++++++++++++
 
-If you have ``parse_dates`` enabled for some or all of your columns, and your
-datetime strings are all formatted the same way, you may get a large speed
-up by setting ``infer_datetime_format=True``.  If set, pandas will attempt
-to guess the format of your datetime strings, and then use a faster means
-of parsing the strings.  5-10x parsing speeds have been observed.  pandas
-will fallback to the usual parsing if either the format cannot be guessed
-or the format that was guessed cannot properly parse the entire column
-of strings.  So in general, ``infer_datetime_format`` should not have any
-negative consequences if enabled.
-
-Here are some examples of datetime strings that can be guessed (All
-representing December 30th, 2011 at 00:00:00):
-
-* "20111230"
-* "2011/12/30"
-* "20111230 00:00:00"
-* "12/30/2011 00:00:00"
-* "30/Dec/2011 00:00:00"
-* "30/December/2011 00:00:00"
-
-Note that ``infer_datetime_format`` is sensitive to ``dayfirst``.  With
-``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With
-``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th.
+If you try to parse a column of date strings, pandas will attempt to guess the format
+from the first non-NaN element, and will then parse the rest of the column with that
+format.
 
 .. ipython:: python
 
-   # Try to infer the format for the index column
    df = pd.read_csv(
        "foo.csv",
        index_col=0,
        parse_dates=True,
-       infer_datetime_format=True,
    )
    df
 
+In the case that you have mixed datetime formats within the same column, you'll need to
+first read it in as an object dtype and then apply :func:`to_datetime` to each element.
+
+.. ipython:: python
+
+   data = io.StringIO("date\n12 Jan 2000\n2000-01-13\n")
+   df = pd.read_csv(data)
+   df['date'] = df['date'].apply(pd.to_datetime)
+   df
+
 .. ipython:: python
    :suppress:
 

diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst
@@ -13,17 +13,6 @@ a tremendous amount of new functionality for manipulating time series data.
 
 For example, pandas supports:
 
-Parsing time series information from various sources and formats
-
-.. ipython:: python
-
-   import datetime
-
-   dti = pd.to_datetime(
-       ["1/1/2018", np.datetime64("2018-01-01"), datetime.datetime(2018, 1, 1)]
-   )
-   dti
-
 Generate sequences of fixed-frequency dates and time spans
 
 .. ipython:: python
@@ -132,6 +121,8 @@ time.
 
 .. ipython:: python
 
+   import datetime
+
    pd.Timestamp(datetime.datetime(2012, 5, 1))
    pd.Timestamp("2012-05-01")
    pd.Timestamp(2012, 5, 1)
@@ -196,26 +187,24 @@ is converted to a ``DatetimeIndex``:
 
 .. ipython:: python
 
-    pd.to_datetime(pd.Series(["Jul 31, 2009", "2010-01-10", None]))
+    pd.to_datetime(pd.Series(["Jul 31, 2009", "Jan 10, 2010", None]))
 
-    pd.to_datetime(["2005/11/23", "2010.12.31"])
+    pd.to_datetime(["2005/11/23", "2010/12/31"])
 
 If you use dates which start with the day first (i.e. European style),
 you can pass the ``dayfirst`` flag:
 
 .. ipython:: python
-   :okwarning:
+    :okwarning:
 
     pd.to_datetime(["04-01-2012 10:00"], dayfirst=True)
-
-    pd.to_datetime(["14-01-2012", "01-14-2012"], dayfirst=True)
+    pd.to_datetime(["04-14-2012 10:00"], dayfirst=True)
 
 .. warning::
 
    You see in the above example that ``dayfirst`` isn't strict. If a date
    can't be parsed with the day being first it will be parsed as if
-   ``dayfirst`` were False, and in the case of parsing delimited date strings
-   (e.g. ``31-12-2012``) then a warning will also be raised.
+   ``dayfirst`` were False and a warning will also be raised.
 
 If you pass a single string to ``to_datetime``, it returns a single ``Timestamp``.
 ``Timestamp`` can also accept string input, but it doesn't accept string parsing
@@ -768,7 +757,7 @@ partially matching dates:
    rng2 = pd.date_range("2011-01-01", "2012-01-01", freq="W")
    ts2 = pd.Series(np.random.randn(len(rng2)), index=rng2)
 
-   ts2.truncate(before="2011-11", after="2011-12")
+   ts2.truncate(before="2011-11-01", after="2011-12-01")
    ts2["2011-11":"2011-12"]
 
 Even complicated fancy indexing that breaks the ``DatetimeIndex`` frequency

diff --git a/doc/source/whatsnew/v2.0.0.rst b/doc/source/whatsnew/v2.0.0.rst
@@ -114,6 +114,39 @@ Optional libraries below the lowest tested version may still work, but are not c
 
 See :ref:`install.dependencies` and :ref:`install.optional_dependencies` for more.
 
+Datetimes are now parsed with a consistent format
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:func:`to_datetime` now parses dates with a consistent format, which is guessed from the first non-NA value
+(unless ``format`` is specified). Previously, it would have guessed the format for each element individually.
+
+*Old behavior*:
+
+  .. code-block:: ipython
+
+     In [1]: ser = pd.Series(['13-01-2000', '12-01-2000'])
+     In [2]: pd.to_datetime(ser)
+     Out[2]:
+     0   2000-01-13
+     1   2000-12-01
+     dtype: datetime64[ns]
+
+*New behavior*:
+
+  .. ipython:: python
+    :okwarning:
+
+     ser = pd.Series(['13-01-2000', '12-01-2000'])
+     pd.to_datetime(ser)
+
+Note that this affects :func:`read_csv` as well.
+
+If you still need to parse dates with inconsistent formats, you'll need to apply :func:`to_datetime`
+to each element individually, e.g. ::
+
+     ser = pd.Series(['13-01-2000', '12 January 2000'])
+     ser.apply(pd.to_datetime)
+
 .. _whatsnew_200.api_breaking.other:
 
 Other API changes

diff --git a/pandas/_libs/tslib.pxd b/pandas/_libs/tslib.pxd
@@ -0,0 +1,20 @@
+from pandas._libs.tslibs.np_datetime cimport (
+    NPY_DATETIMEUNIT,
+    npy_datetimestruct,
+)
+
+
+cdef extern from "src/datetime/np_datetime_strings.h":
+    ctypedef struct ISOInfo:
+        const char *format
+        int format_len
+        const char *date_sep
+        const char *time_sep
+        const char *micro_or_tz
+        int year
+        int month
+        int day
+        int hour
+        int minute
+        int second
+        int exact
diff --git a/pandas/_libs/tslib.pyx b/pandas/_libs/tslib.pyx
@@ -85,6 +85,22 @@ def _test_parse_iso8601(ts: str):
         _TSObject obj
         int out_local = 0, out_tzoffset = 0
         NPY_DATETIMEUNIT out_bestunit
+        ISOInfo iso_info
+
+    iso_info = ISOInfo(
+                        format='',
+                        format_len=0,
+                        date_sep='',
+                        time_sep='',
+                        micro_or_tz='',
+                        year=False,
+                        month=False,
+                        day=False,
+                        hour=False,
+                        minute=False,
+                        second=False,
+                        exact=False,
+                        )
 
     obj = _TSObject()
 
@@ -93,7 +109,7 @@ def _test_parse_iso8601(ts: str):
     elif ts == 'today':
         return Timestamp.now().normalize()
 
-    string_to_dts(ts, &obj.dts, &out_bestunit, &out_local, &out_tzoffset, True)
+    string_to_dts(ts, &obj.dts, &out_bestunit, &out_local, &out_tzoffset, True, &iso_info)
     obj.value = npy_datetimestruct_to_datetime(NPY_FR_ns, &obj.dts)
     check_dts_bounds(&obj.dts)
     if out_local == 1:
@@ -443,6 +459,7 @@ def first_non_null(values: ndarray) -> int:
 @cython.boundscheck(False)
 cpdef array_to_datetime(
     ndarray[object] values,
+    ISOInfo iso_info,
     str errors='raise',
     bint dayfirst=False,
     bint yearfirst=False,
@@ -510,6 +527,7 @@ cpdef array_to_datetime(
         tzinfo tz_out = None
         bint found_tz = False, found_naive = False
 
+
     # specify error conditions
     assert is_raise or is_ignore or is_coerce
 
@@ -568,6 +586,16 @@ cpdef array_to_datetime(
                     iresult[i] = get_datetime64_nanos(val, NPY_FR_ns)
 
                 elif is_integer_object(val) or is_float_object(val):
+                    if require_iso8601:
+                        if is_coerce:
+                            iresult[i] = NPY_NAT
+                            continue
+                        elif is_raise:
+                            raise ValueError(
+                                f"time data \"{val}\" at position {i} doesn't match format {iso_info.format.decode('utf-8')}"
+                            )
+                        return values, tz_out
+
                     # these must be ns unit by-definition
                     seen_integer = True
 
@@ -598,7 +626,7 @@ cpdef array_to_datetime(
 
                     string_to_dts_failed = string_to_dts(
                         val, &dts, &out_bestunit, &out_local,
-                        &out_tzoffset, False
+                        &out_tzoffset, False, &iso_info,
                     )
                     if string_to_dts_failed:
                         # An error at this point is a _parsing_ error
@@ -613,7 +641,7 @@ cpdef array_to_datetime(
                                 continue
                             elif is_raise:
                                 raise ValueError(
-                                    f"time data \"{val}\" at position {i} doesn't match format specified"
+                                    f"time data \"{val}\" at position {i} doesn't match format {iso_info.format.decode('utf-8')}"
                                 )
                             return values, tz_out
 

diff --git a/pandas/_libs/tslibs/conversion.pxd b/pandas/_libs/tslibs/conversion.pxd
@@ -40,3 +40,18 @@ cdef int64_t cast_from_unit(object ts, str unit) except? -1
 cpdef (int64_t, int) precision_from_unit(str unit)
 
 cdef maybe_localize_tso(_TSObject obj, tzinfo tz, NPY_DATETIMEUNIT reso)
+
+cdef extern from "src/datetime/np_datetime_strings.h":
+    ctypedef struct ISOInfo:
+        const char *format
+        int format_len
+        const char *date_sep
+        const char *time_sep
+        const char *micro_or_tz
+        int year
+        int month
+        int day
+        int hour
+        int minute
+        int second
+        int exact
diff --git a/pandas/_libs/tslibs/conversion.pyx b/pandas/_libs/tslibs/conversion.pyx
@@ -469,6 +469,22 @@ cdef _TSObject _convert_str_to_tsobject(object ts, tzinfo tz, str unit,
         datetime dt
         int64_t ival
         NPY_DATETIMEUNIT out_bestunit
+        ISOInfo iso_info
+
+    iso_info = ISOInfo(
+        format='',
+        format_len=0,
+                        date_sep='',
+                        time_sep='',
+                        micro_or_tz='',
+                        year=False,
+                        month=False,
+                        day=False,
+                        hour=False,
+                        minute=False,
+                        second=False,
+                        exact=False,
+    )
 
     if len(ts) == 0 or ts in nat_strings:
         ts = NaT
@@ -488,7 +504,7 @@ cdef _TSObject _convert_str_to_tsobject(object ts, tzinfo tz, str unit,
     else:
         string_to_dts_failed = string_to_dts(
             ts, &dts, &out_bestunit, &out_local,
-            &out_tzoffset, False
+            &out_tzoffset, False, &iso_info,
         )
         if not string_to_dts_failed:
             try:

diff --git a/pandas/_libs/tslibs/np_datetime.pxd b/pandas/_libs/tslibs/np_datetime.pxd
@@ -95,6 +95,7 @@ cdef int string_to_dts(
     int* out_local,
     int* out_tzoffset,
     bint want_exc,
+    ISOInfo* iso_info,
 ) except? -1
 
 cdef NPY_DATETIMEUNIT get_unit_from_dtype(cnp.dtype dtype)
@@ -118,3 +119,19 @@ cdef int64_t convert_reso(
     NPY_DATETIMEUNIT to_reso,
     bint round_ok,
 ) except? -1
+
+cdef extern from "src/datetime/np_datetime_strings.h":
+
+    ctypedef struct ISOInfo:
+        const char *format
+        int format_len
+        const char *date_sep
+        const char *time_sep
+        const char *micro_or_tz
+        int year
+        int month
+        int day
+        int hour
+        int minute
+        int second
+        int exact