Skip to content

ENH: Infer best datetime format from a sample #52626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 53 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
fd7a534
All tests pass
LeoGrin Apr 11, 2023
93f9c7a
Update changelog
LeoGrin Apr 12, 2023
c37d40f
Add missing type hints
LeoGrin Apr 12, 2023
cbb5e0d
Cleaning
LeoGrin Apr 12, 2023
81664a2
Typo
LeoGrin Apr 12, 2023
ef33ba0
comment change
LeoGrin Apr 12, 2023
e1652f1
simplification
LeoGrin Apr 13, 2023
6b371ca
remove randomness
LeoGrin Apr 13, 2023
705d1b4
fix parser tests
LeoGrin Apr 14, 2023
27e39f3
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 14, 2023
0bae15d
simplify getting evenly spaced non null
LeoGrin Apr 14, 2023
de7331f
update io readme
LeoGrin Apr 14, 2023
9136b4f
revert changed tests
LeoGrin Apr 14, 2023
9f966d5
fix type hints
LeoGrin Apr 14, 2023
be9e27a
Merge branch 'main' of https://github.com/pandas-dev/pandas into date…
LeoGrin Apr 14, 2023
e5e3cb3
Merge branch 'datetime_format_inference_test' of https://github.com/L…
LeoGrin Apr 14, 2023
7ca7244
fix type hints for np.unique
LeoGrin Apr 14, 2023
4b81192
remove prints
LeoGrin Apr 14, 2023
001a270
fix doc
LeoGrin Apr 14, 2023
fe99f83
fix example with febuary 30th
LeoGrin Apr 14, 2023
1d5b6d1
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 14, 2023
8de90e4
fix doc
LeoGrin Apr 14, 2023
0b5ec7d
Merge branch 'datetime_format_inference_test' of https://github.com/L…
LeoGrin Apr 14, 2023
544aade
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 14, 2023
a236ba9
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 16, 2023
40ac264
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 17, 2023
4cc9b0e
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 19, 2023
0b4d46e
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 20, 2023
c3dbb82
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 20, 2023
8916625
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 21, 2023
5fbe28e
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 23, 2023
47fe413
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 23, 2023
281d45b
All tests pass
LeoGrin Apr 11, 2023
51d9d98
Update changelog
LeoGrin Apr 12, 2023
1d7df6e
Add missing type hints
LeoGrin Apr 12, 2023
e6cf3ad
Cleaning
LeoGrin Apr 12, 2023
6998bf8
Typo
LeoGrin Apr 12, 2023
f98ea1f
comment change
LeoGrin Apr 12, 2023
86aa61c
simplification
LeoGrin Apr 13, 2023
8c6401b
remove randomness
LeoGrin Apr 13, 2023
28cf679
fix parser tests
LeoGrin Apr 14, 2023
a22114c
simplify getting evenly spaced non null
LeoGrin Apr 14, 2023
75bb8f6
update io readme
LeoGrin Apr 14, 2023
6f155b5
revert changed tests
LeoGrin Apr 14, 2023
2b2648e
fix type hints
LeoGrin Apr 14, 2023
3f02e0a
fix type hints for np.unique
LeoGrin Apr 14, 2023
feaa7a3
remove prints
LeoGrin Apr 14, 2023
60148b1
fix doc
LeoGrin Apr 14, 2023
6622eba
fix example with febuary 30th
LeoGrin Apr 14, 2023
23b28b9
fix doc
LeoGrin Apr 14, 2023
5cbfb2f
Merge branch 'main' into datetime_format_inference_test
LeoGrin Apr 24, 2023
5422bfa
check if any str at the beginning of _guess_datetime_format_for_array
LeoGrin Apr 24, 2023
3aa3cde
check if any str at the beginning of _guess_datetime_format_for_array
LeoGrin Apr 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -977,11 +977,10 @@ Note that format inference is sensitive to ``dayfirst``. With
``dayfirst=True``, it will guess "01/12/2011" to be December 1st. With
``dayfirst=False`` (default) it will guess "01/12/2011" to be January 12th.

If you try to parse a column of date strings, pandas will attempt to guess the format
from the first non-NaN element, and will then parse the rest of the column with that
format. If pandas fails to guess the format (for example if your first string is
``'01 December US/Pacific 2000'``), then a warning will be raised and each
row will be parsed individually by ``dateutil.parser.parse``. The safest
If you try to parse a column of date strings, pandas will attempt to find the format
which work best from a sample of non-NaN elements, and will then parse the rest of the
column with that format. If pandas fails to guess the format, then a warning will be
raised and each row will be parsed individually by ``dateutil.parser.parse``. The safest
way to parse dates is to explicitly set ``format=``.

.. ipython:: python
Expand All @@ -994,7 +993,9 @@ way to parse dates is to explicitly set ``format=``.
df

In the case that you have mixed datetime formats within the same column, you can
pass ``format='mixed'``
pass ``format='mixed'``. Pandas will convert rows to the best format found (the one
which matches the most rows), and then iteratively convert the remaining rows with the
remaining formats.

.. ipython:: python

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.19.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -765,6 +765,7 @@ Previously if ``.to_datetime()`` encountered mixed integers/floats and strings,
This will now convert integers/floats with the default unit of ``ns``.

.. ipython:: python
:okwarning:

pd.to_datetime([1, "foo"], errors="coerce")
Comment on lines 767 to 770
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the warning UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please... is raised when any value is a string (if no format was found). Before, it was raised when the first non-null value was a string, so wouldn't be raised in this example, but would be raised on pd.to_datetime(["foo", 1], errors="coerce") for instance.


Expand Down
49 changes: 47 additions & 2 deletions doc/source/whatsnew/v2.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,53 @@ Enhancements

.. _whatsnew_210.enhancements.enhancement1:

enhancement1
^^^^^^^^^^^^
``pd.to_datetime`` now tries to infer the datetime format of each string by considering
a random sample (instead of the first non-null sample),
and tries to find the format which work for most strings. If several
formats work as well, the one which matches the ``dayfirst`` parameter is returned. If
``format="mixed"``, pandas does the same thing, then tries the second best format on the
strings which failed to parse with the first best format, and so on (:issue:`52508`).

*Previous behavior*:

.. code-block:: ipython

In [1]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"])
Out[1]:
ValueError: time data "30-01-2012" doesn't match format "%m-%d-%Y", at position 2. You might want to try:
- passing `format` if your strings have a consistent format;
- passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

In [2]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], errors="coerce")
Out[2]:
DatetimeIndex(['2012-01-02', '2012-01-03', 'NaT'], dtype='datetime64[ns]', freq=None)

In [3]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], format="mixed")
Out[3]:
DatetimeIndex(['2012-01-02', '2012-01-03', '2012-01-30'], dtype='datetime64[ns]', freq=None)


*New behavior*:

.. code-block:: ipython

In [1]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"])
Out[1]:
UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False was specified.
Pass `dayfirst=True` or specify a format to silence this warning.
DatetimeIndex(['2012-02-01', '2012-03-01', '2012-01-30'], dtype='datetime64[ns]',
freq=None)

In [2]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], errors="coerce")
Out[2]:
UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False was specified. Pass `dayfirst=True` or specify a format to silence this warning.
DatetimeIndex(['2012-02-01', '2012-03-01', '2012-01-30'], dtype='datetime64[ns]', freq=None)

In [3]: pd.to_datetime(["01-02-2012", "01-03-2012", "30-01-2012"], format="mixed")
Out[3]:
DatetimeIndex(['2012-02-01', '2012-03-01', '2012-01-30'], dtype='datetime64[ns]', freq=None)


.. _whatsnew_210.enhancements.enhancement2:

Expand Down
Loading