Skip to content

Improved docstring and return type hints for to_datetime #42494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jan 5, 2022
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
bd4061c
Update datetimes.py
smarie Jul 12, 2021
66c725d
from code review: improved utc doc
Oct 4, 2021
74f6aa2
Merge branch 'master' of https://github.com/pandas-dev/pandas into pa…
Oct 4, 2021
f5cbef8
Improved overall readability by
Oct 5, 2021
866bdcb
minor improvement
Oct 5, 2021
cd3ec35
Minor fix and improvement again
Oct 5, 2021
1be053a
Changed order of output description to match the global section doc
Oct 5, 2021
41a1e53
Removed the "type: ignore" since the return type hints are now fixed
Oct 5, 2021
95cfc54
Removed type hint-related mods (will move to a separate pr)
Oct 8, 2021
6569b1e
Merge branch 'master' of https://github.com/pandas-dev/pandas into pa…
Oct 8, 2021
8ebd77e
Removed backslash characters from doctests as per code review
Oct 11, 2021
8baf7bf
As per code review: replaced all "tz-" with "timezone-"
Oct 11, 2021
bc26945
Code review: capitalized if
Oct 11, 2021
83ef850
Compressed output description as per code review.
Oct 28, 2021
0b21772
Moved the general summary to a notes section
Oct 28, 2021
83ddfe7
Update pandas/core/tools/datetimes.py
smarie Oct 28, 2021
8e1ebf0
As per code review: reduced the utc param description and added struc…
Dec 17, 2021
d8cbe8a
Merge branch 'master' of https://github.com/pandas-dev/pandas into pa…
Dec 17, 2021
5310779
Minor edits
Dec 17, 2021
2b22544
Changed as per code review
Dec 17, 2021
2b63ea7
Changed as per code review
Dec 18, 2021
70a7c8f
what's new attempt
Dec 18, 2021
7739e12
Revert "what's new attempt"
Jan 3, 2022
4e87e39
Merge branch 'master' of https://github.com/pandas-dev/pandas into pa…
Jan 3, 2022
de9fe69
Merge branch 'master' of https://github.com/pandas-dev/pandas into pa…
Jan 4, 2022
5f4dbb8
Changed as per code review: added sphinx directives wherever possible…
Jan 4, 2022
a2fb1a1
Changed as per code review: added const role
Jan 4, 2022
04312bd
sphinx role
Jan 4, 2022
514b0c4
Changed as per code review: sphinx roles
Jan 4, 2022
fc2395d
minor change again
Jan 4, 2022
e0cf329
Last polishing round: sphinx roles and a few fixes
Jan 4, 2022
1421830
Fixed typo
Jan 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -637,8 +637,8 @@ Timedelta
Timezones
^^^^^^^^^
- Bug in :func:`to_datetime` with ``infer_datetime_format=True`` failing to parse zero UTC offset (``Z``) correctly (:issue:`41047`)
- Clarified :func:`to_datetime` documentation concerning parameter ``utc`` and the impact of its default value (``False``) on parsing datetimes from a timezone with varying time offsets (daylight savings) (:issue:`42229`).
- Bug in :meth:`Series.dt.tz_convert` resetting index in a :class:`Series` with :class:`CategoricalIndex` (:issue:`43080`)
-

Numeric
^^^^^^^
Expand Down
214 changes: 178 additions & 36 deletions pandas/core/tools/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -691,6 +691,9 @@ def to_datetime(
"""
Convert argument to datetime.

This function converts a scalar, array-like, :class:`Series` or
:class:`DataFrame`/dict-like to a pandas datetime object.

Parameters
----------
arg : int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
Expand Down Expand Up @@ -726,13 +729,29 @@ def to_datetime(
with year first.

utc : bool, default None
Return UTC DatetimeIndex if True (converting any tz-aware
datetime.datetime objects as well).
Control timezone-related parsing, localization and conversion.

- If ``True``, the function *always* returns a timezone-aware
UTC-localized Timestamp, Series or DatetimeIndex. To do this,
timezone-naive inputs are *localized* as UTC, while
timezone-aware inputs are *converted* to UTC.

- If ``False`` (default), inputs will not be coerced to UTC.
Timezone-naive inputs will remain naive, while timezone-aware ones
will keep their time offsets. Limitations exist for mixed
offsets (typically, daylight savings), see :ref:`Examples
<to_datetime_tz_examples>` section for details.

See also: pandas general documentation about `timezone conversion and
localization
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#time-zone-handling>`_.

format : str, default None
The strftime to parse time, eg "%d/%m/%Y", note that "%f" will parse
all the way up to nanoseconds.
See strftime documentation for more information on choices:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.
all the way up to nanoseconds. See `strftime documentation
<https://docs.python.org/3/library/datetime.html
#strftime-and-strptime-behavior>`_ for more information on choices.
exact : bool, True by default
Behaves as:
- If True, require an exact format match.
Expand Down Expand Up @@ -772,28 +791,82 @@ def to_datetime(
-------
datetime
If parsing succeeded.
Return type depends on input:

- list-like:
- DatetimeIndex, if timezone naive or aware with the same timezone
- Index of object dtype, if timezone aware with mixed time offsets
- Series: Series of datetime64 dtype
- DataFrame: Series of datetime64 dtype
- scalar: Timestamp

In case when it is not possible to return designated types (e.g. when
any element of input is before Timestamp.min or after Timestamp.max)
return will have datetime.datetime type (or corresponding
array/Series).
Return type depends on input (types in parenthesis correspond to
fallback in case of unsuccessful timezone or out-of-range timestamp
parsing):

- scalar: Timestamp (or datetime.datetime)
- array-like: DatetimeIndex (or Series with object dtype containing
datetime.datetime)
- Series: Series of datetime64 dtype (or Series of object
dtype containing datetime.datetime)
- DataFrame: Series of datetime64 dtype (or Series of object
dtype containing datetime.datetime)

Raises
------
ParserError
When parsing a date from string fails.
ValueError
When another datetime conversion error happens. For example when one
of 'year', 'month', day' is missing in a :class:`DataFrame`, or when
a Timezone-aware datetime.datetime is found in an array-like of mixed
time offsets, and utc=False.

See Also
--------
DataFrame.astype : Cast argument to a specified dtype.
to_timedelta : Convert argument to timedelta.
convert_dtypes : Convert dtypes.

Notes
-----

Many input types are supported, and lead to different output types:

- scalars can be int, float, str, datetime object (from stdlib datetime
module or numpy). They are converted to :class:`Timestamp` when possible,
otherwise they are converted to ``datetime.datetime``. None/NaN/null
scalars are converted to ``NaT``.

- array-like can contain int, float, str, datetime objects. They are
converted to :class:`DatetimeIndex` when possible, otherwise they are
converted to :class:`Index` with object dtype, containing
``datetime.datetime``. None/NaN/null entries are converted to ``NaT`` in
both cases.

- :class:`Series` are converted to :class:`Series` with datetime64 dtype
when possible, otherwise they are converted to :class:`Series` with
object dtype, containing ``datetime.datetime``. None/NaN/null entries
are converted to ``NaT`` in both cases.

- :class:`DataFrame`/dict-like are converted to :class:`Series` with
datetime64 dtype. For each row a datetime is created from assembling
the various dataframe columns. Column keys can be common abbreviations
like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or
plurals of the same.

The following causes are responsible for datetime.datetime objects being
returned (possibly inside an Index or a Series with object dtype) instead
of a proper pandas designated type (Timestamp, DatetimeIndex or Series
with datetime64 dtype):

- when any input element is before Timestamp.min or after Timestamp.max,
see `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_.

- when utc=False (default) and the input is an array-like or Series
containing mixed naive/aware datetime, or aware with mixed time offsets.
Note that this happens in the (quite frequent) situation when the
timezone has a daylight savings policy. In that case you may wish to
use utc=True.

Examples
--------

**Handling various input formats**

Assembling a datetime from multiple columns of a DataFrame. The keys can be
common abbreviations like ['year', 'month', 'day', 'minute', 'second',
'ms', 'us', 'ns']) or plurals of the same
Expand All @@ -806,20 +879,7 @@ def to_datetime(
1 2016-03-05
dtype: datetime64[ns]

If a date does not meet the `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_, passing errors='ignore'
will return the original input instead of raising any exception.

Passing errors='coerce' will force an out-of-bounds date to NaT,
in addition to forcing non-dates (or non-parseable dates) to NaT.

>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
NaT

Passing infer_datetime_format=True can often-times speedup a parsing
Passing ``infer_datetime_format=True`` can often-times speedup a parsing
if its not an ISO8601 format exactly, but in a regular format.

>>> s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000'] * 1000)
Expand Down Expand Up @@ -854,16 +914,98 @@ def to_datetime(
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'],
dtype='datetime64[ns]', freq=None)

In case input is list-like and the elements of input are of mixed
timezones, return will have object type Index if utc=False.
**Non-convertible date/times**

If a date does not meet the `timestamp limitations
<https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
#timeseries-timestamp-limits>`_, passing errors='ignore'
will return the original input instead of raising any exception.

Passing ``errors='coerce'`` will force an out-of-bounds date to ``NaT``,
in addition to forcing non-dates (or non-parseable dates) to ``NaT``.

>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
NaT

.. _to_datetime_tz_examples:

**Timezones and time offsets**

The default behaviour (``utc=False``) is as follows:

- Timezone-naive inputs are converted to timezone-naive ``DatetimeIndex``:

>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00:15'])
DatetimeIndex(['2018-10-26 12:00:00', '2018-10-26 13:00:15'],
dtype='datetime64[ns]', freq=None)

- Timezone-aware inputs *with constant time offset* are converted to
timezone-aware ``DatetimeIndex``:

>>> pd.to_datetime(['2018-10-26 12:00 -0500', '2018-10-26 13:00 -0500'])
DatetimeIndex(['2018-10-26 12:00:00-05:00', '2018-10-26 13:00:00-05:00'],
dtype='datetime64[ns, pytz.FixedOffset(-300)]', freq=None)

- However, timezone-aware inputs *with mixed time offsets* (for example
issued from a timezone with daylight savings, such as Europe/Paris)
are **not successfully converted** to a ``DatetimeIndex``. Instead a
simple ``Index`` containing ``datetime.datetime`` objects is returned:

>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'])
Index([2018-10-26 12:00:00-05:30, 2018-10-26 12:00:00-05:00], dtype='object')
>>> pd.to_datetime(['2020-10-25 02:00 +0200', '2020-10-25 04:00 +0100'])
Index([2020-10-25 02:00:00+02:00, 2020-10-25 04:00:00+01:00],
dtype='object')

- A mix of timezone-aware and timezone-naive inputs is converted to
a timezone-aware ``DatetimeIndex`` if the offsets of the timezone-aware
are constant:

>>> from datetime import datetime
>>> pd.to_datetime(["2020-01-01 01:00 -01:00", datetime(2020, 1, 1, 3, 0)])
DatetimeIndex(['2020-01-01 01:00:00-01:00', '2020-01-01 02:00:00-01:00'],
dtype='datetime64[ns, pytz.FixedOffset(-60)]', freq=None)

- Finally, mixing timezone-aware strings and ``datetime.datetime`` always
raises an error, even if the elements all have the same time offset.

>>> from datetime import datetime, timezone, timedelta
>>> d = datetime(2020, 1, 1, 18, tzinfo=timezone(-timedelta(hours=1)))
>>> pd.to_datetime(["2020-01-01 17:00 -0100", d])
Traceback (most recent call last):
...
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64
unless utc=True

|

Setting ``utc=True`` solves most of the above issues:

- Timezone-naive inputs are *localized* as UTC

>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00'], utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 13:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)

- Timezone-aware inputs are *converted* to UTC (the output represents the
exact same datetime, but viewed from the UTC time offset `+00:00`).

>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'],
... utc=True)
DatetimeIndex(['2018-10-26 17:30:00+00:00', '2018-10-26 17:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)

- Inputs can contain both naive and aware, string or datetime, the above
rules still apply

>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 12:00 -0530',
... datetime(2020, 1, 1, 18),
... datetime(2020, 1, 1, 18,
... tzinfo=timezone(-timedelta(hours=1)))],
... utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 17:30:00+00:00',
'2020-01-01 18:00:00+00:00', '2020-01-01 19:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq=None)
"""
if arg is None:
return None
Expand Down