Skip to content

BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Dec 4, 2022

Previously, %Y%m%d was excluded from the ISO8601 fast path, and went down its own special path.

This:

  • introduced complexity
  • introduced bugs (like the one linked)

%Y%m%d is listed as an example of ISO8601 date on Wiki https://en.wikipedia.org/wiki/ISO_8601#Calendar_dates:

the standard allows both the "YYYY-MM-DD" and YYYYMMDD formats for complete calendar date representations

So, I'd suggest just getting rid of the special %Y%m%d path, and just going down the ISO8601 path


Performance degrades, and is now on-par with other non-ISO8601 formats. I think this is fine - if the code is buggy, it doesn't matter how fast it is.

On upstream/main:

In [1]: dates = pd.date_range('1700', '2200').strftime('%Y%m%d')

In [2]: %%timeit
   ...: to_datetime(dates, format='%Y%m%d')
   ...: 
   ...: 
68.6 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

On this branch:

In [1]: dates = pd.date_range('1700', '2200').strftime('%Y%m%d')

In [2]: %%timeit
   ...: to_datetime(dates, format='%Y%m%d')
   ...: 
   ...: 
212 ms ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that this is comparable performance with other non-ISO8601 formats:

In [4]: %%timeit
   ...: to_datetime(dates, format='%Y foo %m bar %d')
   ...: 
   ...: 
235 ms ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The fastest thing in this example would actually be to not specify a format, but then that would fail on 6-digit %Y%m%d dates (see #17410 (comment))

Comment on lines -30 to -34
def try_parse_year_month_day(
years: npt.NDArray[np.object_], # object[:]
months: npt.NDArray[np.object_], # object[:]
days: npt.NDArray[np.object_], # object[:]
) -> npt.NDArray[np.object_]: ...
Copy link
Member Author

@MarcoGorelli MarcoGorelli Dec 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this function altogether, special-casing %Y%m%d just introduces bugs

The non-ISO8601 path can handle this format just fine, let's use that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this impact perf? im assuming this was introduced as a fastpath

Comment on lines 137 to 143
def test_to_datetime_format_YYYYMMDD_ignore(self, cache):
# coercion
# GH 7930
ser = Series([20121231, 20141231, 99991231])
expected = Series([20121231, 20141231, 99991231], dtype=object)
result = to_datetime(ser, format="%Y%m%d", errors="ignore", cache=cache)
expected = Series(
[datetime(2012, 12, 31), datetime(2014, 12, 31), datetime(9999, 12, 31)],
dtype=object,
)
tm.assert_series_equal(result, expected)
Copy link
Member Author

@MarcoGorelli MarcoGorelli Dec 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks like it wasn't right to begin with. If the input is invalid, errors='ignore' should return the input

that's what we do for literally any other format, e.g.

In [46]: to_datetime(Series([201212, 201412, 999912]), errors='ignore', format='%Y%m')
Out[46]:
0    201212
1    201412
2    999912
dtype: object

@MarcoGorelli MarcoGorelli added Datetime Datetime data dtype Bug labels Dec 4, 2022
Comment on lines -128 to +129
result = to_datetime(ser, format="%Y%m%d", cache=cache)
tm.assert_series_equal(result, expected)
with pytest.raises(ValueError, match=None):
to_datetime(ser, format="%Y%m%d", cache=cache)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should've always raised - 19801222.0 doesn't match the format "%Y%m%d"

Comment on lines 935 to +936
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
'13000101'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one also looks like it was wrong to begin with - we don't return datetime.datetime with errors='ignore' for any other format, e.g.

In [62]: to_datetime('1300', format='%Y', errors='ignore')
Out[62]: '1300'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there many other cases where we get pydatetime objects instead of Timestamps? If no, then i think it makes sense to tighten and say "you always get Timestamps or ignore/raise". If yes, then i think returning the pydatetime objects is better here than the string (and probably also for '1300' with '%Y')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #50072, let's move this discussion there

@MarcoGorelli MarcoGorelli added the Performance Memory or execution speed performance label Dec 5, 2022
@MarcoGorelli MarcoGorelli requested a review from jreback December 5, 2022 09:21
@MarcoGorelli MarcoGorelli marked this pull request as draft December 5, 2022 12:16
@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 5, 2022 14:26
@MarcoGorelli MarcoGorelli marked this pull request as draft December 5, 2022 15:49
@MarcoGorelli MarcoGorelli marked this pull request as ready for review December 5, 2022 16:09
@MarcoGorelli MarcoGorelli marked this pull request as draft December 5, 2022 16:12
result = to_datetime(ser2, format="%Y%m%d", cache=cache)
tm.assert_series_equal(result, expected)
with pytest.raises(ValueError, match=None):
to_datetime(ser2, format="%Y%m%d", cache=cache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a test where the NaN value is pd.NA, as the following should work:

ser3 = ser.astype("Int64")
ser3[2] = pd.NA
to_datetime(ser3, format="%Y%m%d")

@@ -125,24 +126,21 @@ def test_to_datetime_format_YYYYMMDD_with_nat(self, cache):
expected[2] = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to have expected if it raises. Although the reason it is raising is because np.nan changes the dtype to float. So this could be a loss of functionality when people have missing data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - I've marked as draft as there might be a better solution out there which doesn't degrade performance, just working on it

@@ -1241,60 +1198,6 @@ def coerce(values):
return values


def _attempt_YYYYMMDD(arg: npt.NDArray[np.object_], errors: str) -> np.ndarray | None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think there's an issue about this fastpath not actually being fast. if this is removed, the issue is probably closeable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, not quite

I've left a comment there, but removing it wouldn't fix the issue because 6-digit %Y%m%d dates wouldn't get parsed

@MarcoGorelli
Copy link
Member Author

closing, probably superseded by #50242

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: to_datetime with decimal number doesn't fail for %Y%m%d PERF: to_datime fastpath for %Y%m%d is slower
3 participants