-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Improve to_datetime bounds checking #50183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve to_datetime bounds checking #50183
Conversation
result2=to_datetime(should_succeed, unit="Y", errors='coerce') | ||
result3=to_datetime(should_succeed, unit="Y", errors='ignore') | ||
tm.assert_series_equal(result1, result2) | ||
tm.assert_series_equal(result1, result3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A check that compared to a non to_datetime correct value would be better, but there don't seem to be any "allclose" checks in this file, and using exact equality risks rounding error failing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. (Possibly the reason there weren't any allclose assert_almost_equal here was that that doesn't work on datetime arrays: pandas/_testing/asserters.py:731 calls assert_numpy_array_equal, which doesn't have an atol/rtol.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reported that as #50191
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I've understood correctly:
Before:
values
would get cast toi8
(thus losing some precision, e.g.292.32702463
->292
)- the
i8
cast would then get cast back tof8
, but once this has been done, the result could be smaller than what you started with, because of the loss of precision - the bounds check might (erroneously) not pass
Whereas now:
values
are multiplied directly withmult
- as there's no loss of precision, the out-of-bounds check no longer erroneously passes
That's one change. The other change is to use values != values
in the mask, as was done in #48705. Perhaps let's merge that one first, then rebase this one?
I've just left a question, as one change looks good but orthogonal (it's fine, just checking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#48705 is in - could you rebase please and we can get this in too?
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
I'll add a commit here when I get a chance (hopefully early next week), I think we should ship this, and @rebecca-palmer did some seriously good work here |
Fixed conflicts (which was quite a large change, so wait for the CI before merging - as well as #48705 having done part of this, #50263 moved the code being patched to a non-Cython context, and #50301 forbids the case I was using as a test). Unrelated to my changes, but it looks like #50263 may have accidentally undone the effect of #50301 for some input types - @jbrockmendel ? |
That's entirely plausible. Would be happy to see a PR fixing that. |
That CI fail looks like the same one main then had - retrying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - only minor comments would be to add a whatsnew note in the 2.0.0 bug fixes section, and ideally to have a link to the github issue (or to the pull request if there's no issue) in the test (check some other tests for examples)
(tsmax_in_days happens to be close to an integer, and this is a test of rounding)
Fixed, but also removed the unnecessary copy, so wait for the CI to check that it really is unnecessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me pending green, thanks @rebecca-palmer !
Does the Windows CI timing out suggest there's a hang bug in this, or does that just happen sometimes? (Numpy Dev also fails on main, for reasons that are nothing to do with pandas: numpy/numpy#23033) |
Thanks @rebecca-palmer |
* add test for float to_datetime near overflow bounds * fix float to_datetime near overflow bounds * fix typo and formatting * fix formatting * fix test to not fail on rounding differences * don't use approximate comparison on datetimes, it doesn't work * also can't convert datetime to float * match dtypes * TST: don't try to use non-integer years (see pandas-dev#50301) * TST: don't cross an integer (tsmax_in_days happens to be close to an integer, and this is a test of rounding) * PERF: remove unnecessary copy * add whatsnew
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch
Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch
to_datetime(errors='raise') is supposed to raise an exception on out-of-bounds input. However, it uses rounding to integer for this bounds checking but not the actual conversion, and hence, for input just outside the bounds it may instead return NaT (on x86) or output clipped to the bounds (on arm).
It also assumes that converting NaN to int gives NaT, which may not be true on unusual hardware. (This was how I originally noticed the problem, as Debian tries to build packages on a wide range of hardware.)
This patch fixes both issues. (In the Debian logs linked here, search for 'near-limits test', the first run is unpatched 1.3.5, the second run is patched 1.5.x.)
Its effect on speed is unclear: it is sometimes faster and sometimes slower, possibly because other load on the build machines varies.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.