-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b24ec6e
to
65ee89d
Compare
def try_parse_year_month_day( | ||
years: npt.NDArray[np.object_], # object[:] | ||
months: npt.NDArray[np.object_], # object[:] | ||
days: npt.NDArray[np.object_], # object[:] | ||
) -> npt.NDArray[np.object_]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this function altogether, special-casing %Y%m%d
just introduces bugs
The non-ISO8601 path can handle this format just fine, let's use that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this impact perf? im assuming this was introduced as a fastpath
65ee89d
to
2004fb7
Compare
def test_to_datetime_format_YYYYMMDD_ignore(self, cache): | ||
# coercion | ||
# GH 7930 | ||
ser = Series([20121231, 20141231, 99991231]) | ||
expected = Series([20121231, 20141231, 99991231], dtype=object) | ||
result = to_datetime(ser, format="%Y%m%d", errors="ignore", cache=cache) | ||
expected = Series( | ||
[datetime(2012, 12, 31), datetime(2014, 12, 31), datetime(9999, 12, 31)], | ||
dtype=object, | ||
) | ||
tm.assert_series_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks like it wasn't right to begin with. If the input is invalid, errors='ignore' should return the input
that's what we do for literally any other format, e.g.
In [46]: to_datetime(Series([201212, 201412, 999912]), errors='ignore', format='%Y%m')
Out[46]:
0 201212
1 201412
2 999912
dtype: object
result = to_datetime(ser, format="%Y%m%d", cache=cache) | ||
tm.assert_series_equal(result, expected) | ||
with pytest.raises(ValueError, match=None): | ||
to_datetime(ser, format="%Y%m%d", cache=cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should've always raised - 19801222.0
doesn't match the format "%Y%m%d"
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore') | ||
datetime.datetime(1300, 1, 1, 0, 0) | ||
'13000101' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one also looks like it was wrong to begin with - we don't return datetime.datetime
with errors='ignore'
for any other format, e.g.
In [62]: to_datetime('1300', format='%Y', errors='ignore')
Out[62]: '1300'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there many other cases where we get pydatetime objects instead of Timestamps? If no, then i think it makes sense to tighten and say "you always get Timestamps or ignore/raise". If yes, then i think returning the pydatetime objects is better here than the string (and probably also for '1300' with '%Y')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #50072, let's move this discussion there
result = to_datetime(ser2, format="%Y%m%d", cache=cache) | ||
tm.assert_series_equal(result, expected) | ||
with pytest.raises(ValueError, match=None): | ||
to_datetime(ser2, format="%Y%m%d", cache=cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding a test where the NaN
value is pd.NA
, as the following should work:
ser3 = ser.astype("Int64")
ser3[2] = pd.NA
to_datetime(ser3, format="%Y%m%d")
@@ -125,24 +126,21 @@ def test_to_datetime_format_YYYYMMDD_with_nat(self, cache): | |||
expected[2] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to have expected
if it raises. Although the reason it is raising is because np.nan
changes the dtype
to float. So this could be a loss of functionality when people have missing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - I've marked as draft as there might be a better solution out there which doesn't degrade performance, just working on it
@@ -1241,60 +1198,6 @@ def coerce(values): | |||
return values | |||
|
|||
|
|||
def _attempt_YYYYMMDD(arg: npt.NDArray[np.object_], errors: str) -> np.ndarray | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think there's an issue about this fastpath not actually being fast. if this is removed, the issue is probably closeable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, not quite
I've left a comment there, but removing it wouldn't fix the issue because 6-digit %Y%m%d dates wouldn't get parsed
closing, probably superseded by #50242 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Previously, %Y%m%d was excluded from the ISO8601 fast path, and went down its own special path.
This:
%Y%m%d is listed as an example of ISO8601 date on Wiki https://en.wikipedia.org/wiki/ISO_8601#Calendar_dates:
So, I'd suggest just getting rid of the special %Y%m%d path, and just going down the ISO8601 path
Performance degrades, and is now on-par with other non-ISO8601 formats. I think this is fine - if the code is buggy, it doesn't matter how fast it is.
On upstream/main:
On this branch:
Note that this is comparable performance with other non-ISO8601 formats:
The fastest thing in this example would actually be to not specify a
format
, but then that would fail on 6-digit%Y%m%d
dates (see #17410 (comment))