BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50054

MarcoGorelli · 2022-12-04T14:32:44Z

closes BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50051 (Replace xxxx with the GitHub issue number)
closes PERF: to_datime fastpath for %Y%m%d is slower #17410
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Previously, %Y%m%d was excluded from the ISO8601 fast path, and went down its own special path.

This:

introduced complexity
introduced bugs (like the one linked)

%Y%m%d is listed as an example of ISO8601 date on Wiki https://en.wikipedia.org/wiki/ISO_8601#Calendar_dates:

the standard allows both the "YYYY-MM-DD" and YYYYMMDD formats for complete calendar date representations

So, I'd suggest just getting rid of the special %Y%m%d path, and just going down the ISO8601 path

Performance degrades, and is now on-par with other non-ISO8601 formats. I think this is fine - if the code is buggy, it doesn't matter how fast it is.

On upstream/main:

In [1]: dates = pd.date_range('1700', '2200').strftime('%Y%m%d')

In [2]: %%timeit
   ...: to_datetime(dates, format='%Y%m%d')
   ...: 
   ...: 
68.6 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On this branch:

In [1]: dates = pd.date_range('1700', '2200').strftime('%Y%m%d')

In [2]: %%timeit
   ...: to_datetime(dates, format='%Y%m%d')
   ...: 
   ...: 
212 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that this is comparable performance with other non-ISO8601 formats:

In [4]: %%timeit
   ...: to_datetime(dates, format='%Y foo %m bar %d')
   ...: 
   ...: 
235 ms ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The fastest thing in this example would actually be to not specify a format, but then that would fail on 6-digit %Y%m%d dates (see #17410 (comment))

MarcoGorelli · 2022-12-04T14:33:58Z

pandas/_libs/tslibs/parsing.pyi

-def try_parse_year_month_day(
-    years: npt.NDArray[np.object_],  # object[:]
-    months: npt.NDArray[np.object_],  # object[:]
-    days: npt.NDArray[np.object_],  # object[:]
-) -> npt.NDArray[np.object_]: ...


removing this function altogether, special-casing %Y%m%d just introduces bugs

The non-ISO8601 path can handle this format just fine, let's use that

does this impact perf? im assuming this was introduced as a fastpath

MarcoGorelli · 2022-12-04T14:44:33Z

pandas/tests/tools/test_to_datetime.py

    def test_to_datetime_format_YYYYMMDD_ignore(self, cache):
        # coercion
        # GH 7930
        ser = Series([20121231, 20141231, 99991231])
+        expected = Series([20121231, 20141231, 99991231], dtype=object)
        result = to_datetime(ser, format="%Y%m%d", errors="ignore", cache=cache)
-        expected = Series(
-            [datetime(2012, 12, 31), datetime(2014, 12, 31), datetime(9999, 12, 31)],
-            dtype=object,
-        )
        tm.assert_series_equal(result, expected)


This test looks like it wasn't right to begin with. If the input is invalid, errors='ignore' should return the input

that's what we do for literally any other format, e.g.

In [46]: to_datetime(Series([201212, 201412, 999912]), errors='ignore', format='%Y%m') Out[46]: 0 201212 1 201412 2 999912 dtype: object

pandas/tests/tools/test_to_datetime.py

pandas/_libs/tslib.pyx

MarcoGorelli · 2022-12-04T14:55:07Z

pandas/tests/tools/test_to_datetime.py

-        result = to_datetime(ser, format="%Y%m%d", cache=cache)
-        tm.assert_series_equal(result, expected)
+        with pytest.raises(ValueError, match=None):
+            to_datetime(ser, format="%Y%m%d", cache=cache)


this should've always raised - 19801222.0 doesn't match the format "%Y%m%d"

MarcoGorelli · 2022-12-04T18:28:19Z

pandas/core/tools/datetimes.py

    >>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
-    datetime.datetime(1300, 1, 1, 0, 0)
+    '13000101'


this one also looks like it was wrong to begin with - we don't return datetime.datetime with errors='ignore' for any other format, e.g.

In [62]: to_datetime('1300', format='%Y', errors='ignore') Out[62]: '1300'

are there many other cases where we get pydatetime objects instead of Timestamps? If no, then i think it makes sense to tighten and say "you always get Timestamps or ignore/raise". If yes, then i think returning the pydatetime objects is better here than the string (and probably also for '1300' with '%Y')

I've opened #50072, let's move this discussion there

Dr-Irv · 2022-12-05T17:40:52Z

pandas/tests/tools/test_to_datetime.py

-        result = to_datetime(ser2, format="%Y%m%d", cache=cache)
-        tm.assert_series_equal(result, expected)
+        with pytest.raises(ValueError, match=None):
+            to_datetime(ser2, format="%Y%m%d", cache=cache)


Might be worth adding a test where the NaN value is pd.NA, as the following should work:

ser3 = ser.astype("Int64") ser3[2] = pd.NA to_datetime(ser3, format="%Y%m%d")

Dr-Irv · 2022-12-05T17:41:49Z

pandas/tests/tools/test_to_datetime.py

@@ -125,24 +126,21 @@ def test_to_datetime_format_YYYYMMDD_with_nat(self, cache):
        expected[2] = np.nan


Don't need to have expected if it raises. Although the reason it is raising is because np.nan changes the dtype to float. So this could be a loss of functionality when people have missing data.

thanks - I've marked as draft as there might be a better solution out there which doesn't degrade performance, just working on it

jbrockmendel · 2022-12-05T19:54:54Z

pandas/core/tools/datetimes.py

@@ -1241,60 +1198,6 @@ def coerce(values):
    return values


-def _attempt_YYYYMMDD(arg: npt.NDArray[np.object_], errors: str) -> np.ndarray | None:


i think there's an issue about this fastpath not actually being fast. if this is removed, the issue is probably closeable

Unfortunately, not quite

I've left a comment there, but removing it wouldn't fix the issue because 6-digit %Y%m%d dates wouldn't get parsed

MarcoGorelli · 2022-12-14T08:39:58Z

closing, probably superseded by #50242

remove ymd special-path

2004fb7

MarcoGorelli force-pushed the remove-fallback branch from b24ec6e to 65ee89d Compare December 4, 2022 14:35

MarcoGorelli commented Dec 4, 2022

View reviewed changes

MarcoGorelli force-pushed the remove-fallback branch from 65ee89d to 2004fb7 Compare December 4, 2022 14:37

MarcoGorelli commented Dec 4, 2022

View reviewed changes

pandas/tests/tools/test_to_datetime.py Outdated Show resolved Hide resolved

MarcoGorelli added Datetime Datetime data dtype Bug labels Dec 4, 2022

MarcoGorelli commented Dec 4, 2022

View reviewed changes

pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved

MarcoGorelli commented Dec 4, 2022

View reviewed changes

MarcoGorelli and others added 2 commits December 4, 2022 17:28

Merge branch 'main' into remove-fallback

65149de

fix doctest

83a4ec6

MarcoGorelli commented Dec 4, 2022

View reviewed changes

MarcoGorelli requested review from Dr-Irv, jbrockmendel, WillAyd and mroeschke December 4, 2022 18:29

MarcoGorelli mentioned this pull request Dec 5, 2022

PERF: to_datime fastpath for %Y%m%d is slower #17410

Closed

MarcoGorelli added the Performance Memory or execution speed performance label Dec 5, 2022

MarcoGorelli requested a review from jreback December 5, 2022 09:21

MarcoGorelli added 2 commits December 5, 2022 09:28

Merge remote-tracking branch 'upstream/main' into remove-fallback

754fdfe

update whatsnew

69a487a

MarcoGorelli marked this pull request as draft December 5, 2022 12:16

MarcoGorelli added 5 commits December 5, 2022 12:53

use slow path

a004790

fixup post-merge

ad8a56a

Merge remote-tracking branch 'upstream/main' into remove-fallback

22ca184

simplify

6e8280b

clean up

55dc073

MarcoGorelli marked this pull request as ready for review December 5, 2022 14:26

MarcoGorelli marked this pull request as draft December 5, 2022 15:49

MarcoGorelli marked this pull request as ready for review December 5, 2022 16:09

MarcoGorelli marked this pull request as draft December 5, 2022 16:12

Dr-Irv requested changes Dec 5, 2022

View reviewed changes

jbrockmendel reviewed Dec 5, 2022

View reviewed changes

MarcoGorelli closed this Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50054

BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50054

MarcoGorelli commented Dec 4, 2022 •

edited

Loading

MarcoGorelli Dec 4, 2022 •

edited

Loading

jbrockmendel Dec 5, 2022

MarcoGorelli Dec 4, 2022 •

edited

Loading

MarcoGorelli Dec 4, 2022

MarcoGorelli Dec 4, 2022

jbrockmendel Dec 5, 2022

MarcoGorelli Dec 5, 2022

Dr-Irv Dec 5, 2022

Dr-Irv Dec 5, 2022

MarcoGorelli Dec 5, 2022

jbrockmendel Dec 5, 2022

MarcoGorelli Dec 5, 2022

MarcoGorelli commented Dec 14, 2022

		@@ -125,24 +126,21 @@ def test_to_datetime_format_YYYYMMDD_with_nat(self, cache):
		expected[2] = np.nan

		@@ -1241,60 +1198,6 @@ def coerce(values):
		return values


		def _attempt_YYYYMMDD(arg: npt.NDArray[np.object_], errors: str) -> np.ndarray \| None:

BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50054

BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50054

Conversation

MarcoGorelli commented Dec 4, 2022 • edited Loading

MarcoGorelli Dec 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli Dec 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Dec 14, 2022

MarcoGorelli commented Dec 4, 2022 •

edited

Loading

MarcoGorelli Dec 4, 2022 •

edited

Loading

MarcoGorelli Dec 4, 2022 •

edited

Loading