Improve to_datetime bounds checking #50183

rebecca-palmer · 2022-12-11T15:50:37Z

to_datetime(errors='raise') is supposed to raise an exception on out-of-bounds input. However, it uses rounding to integer for this bounds checking but not the actual conversion, and hence, for input just outside the bounds it may instead return NaT (on x86) or output clipped to the bounds (on arm).

It also assumes that converting NaN to int gives NaT, which may not be true on unusual hardware. (This was how I originally noticed the problem, as Debian tries to build packages on a wide range of hardware.)

This patch fixes both issues. (In the Debian logs linked here, search for 'near-limits test', the first run is unpatched 1.3.5, the second run is patched 1.5.x.)

Its effect on speed is unclear: it is sometimes faster and sometimes slower, possibly because other load on the build machines varies.

[this bug does not appear to be previously reported ] closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
[N/A ] Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

pandas/_libs/tslib.pyx

rebecca-palmer · 2022-12-11T15:58:32Z

pandas/tests/tools/test_to_datetime.py

+        result2=to_datetime(should_succeed, unit="Y", errors='coerce')
+        result3=to_datetime(should_succeed, unit="Y", errors='ignore')
+        tm.assert_series_equal(result1, result2)
+        tm.assert_series_equal(result1, result3)


A check that compared to a non to_datetime correct value would be better, but there don't seem to be any "allclose" checks in this file, and using exact equality risks rounding error failing it

Done. (Possibly the reason there weren't any ~~allclose~~ assert_almost_equal here was that that doesn't work on datetime arrays: pandas/_testing/asserters.py:731 calls assert_numpy_array_equal, which doesn't have an atol/rtol.)

Reported that as #50191

MarcoGorelli

If I've understood correctly:

Before:

values would get cast to i8 (thus losing some precision, e.g. 292.32702463 -> 292)
the i8 cast would then get cast back to f8, but once this has been done, the result could be smaller than what you started with, because of the loss of precision
the bounds check might (erroneously) not pass

Whereas now:

values are multiplied directly with mult
as there's no loss of precision, the out-of-bounds check no longer erroneously passes

That's one change. The other change is to use values != values in the mask, as was done in #48705. Perhaps let's merge that one first, then rebase this one?

I've just left a question, as one change looks good but orthogonal (it's fine, just checking)

pandas/_libs/tslib.pyx

MarcoGorelli

#48705 is in - could you rebase please and we can get this in too?

github-actions · 2023-01-13T02:36:44Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2023-01-18T20:25:09Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

MarcoGorelli · 2023-01-18T20:56:32Z

I'll add a commit here when I get a chance (hopefully early next week), I think we should ship this, and @rebecca-palmer did some seriously good work here

rebecca-palmer · 2023-01-18T22:30:04Z

Fixed conflicts (which was quite a large change, so wait for the CI before merging - as well as #48705 having done part of this, #50263 moved the code being patched to a non-Cython context, and #50301 forbids the case I was using as a test).

Unrelated to my changes, but it looks like #50263 may have accidentally undone the effect of #50301 for some input types - @jbrockmendel ?

jbrockmendel · 2023-01-19T00:01:09Z

Unrelated to my changes, but it looks like #50263 may have accidentally undone the effect of #50301 for some input types - @jbrockmendel ?

That's entirely plausible. Would be happy to see a PR fixing that.

rebecca-palmer · 2023-01-19T07:54:23Z

That CI fail looks like the same one main then had - retrying.

MarcoGorelli

Looks good to me - only minor comments would be to add a whatsnew note in the 2.0.0 bug fixes section, and ideally to have a link to the github issue (or to the pull request if there's no issue) in the test (check some other tests for examples)

(tsmax_in_days happens to be close to an integer, and this is a test of rounding)

rebecca-palmer · 2023-01-19T21:29:28Z

Fixed, but also removed the unnecessary copy, so wait for the CI to check that it really is unnecessary.

MarcoGorelli

Looks good to me pending green, thanks @rebecca-palmer !

rebecca-palmer · 2023-01-20T07:36:08Z

Does the Windows CI timing out suggest there's a hang bug in this, or does that just happen sometimes?

(Numpy Dev also fails on main, for reasons that are nothing to do with pandas: numpy/numpy#23033)

mroeschke · 2023-01-20T17:25:55Z

Thanks @rebecca-palmer

* add test for float to_datetime near overflow bounds * fix float to_datetime near overflow bounds * fix typo and formatting * fix formatting * fix test to not fail on rounding differences * don't use approximate comparison on datetimes, it doesn't work * also can't convert datetime to float * match dtypes * TST: don't try to use non-integer years (see pandas-dev#50301) * TST: don't cross an integer (tsmax_in_days happens to be close to an integer, and this is a test of rounding) * PERF: remove unnecessary copy * add whatsnew

Avoid assuming that NaN casts to NaT (= fails on riscv64/hppa ?) Don't round to int for the bounds check when we don't for the real conversion (wrong near the bounds, and maybe also a waste of time) Author: Rebecca N. Palmer <[email protected]> Forwarded: pandas-dev/pandas#50183 Gbp-Pq: Name float_to_datetime.patch

rebecca-palmer added 2 commits December 11, 2022 14:53

add test for float to_datetime near overflow bounds

f3d0d15

fix float to_datetime near overflow bounds

dff5a3e

rebecca-palmer commented Dec 11, 2022

View reviewed changes

pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved

rebecca-palmer commented Dec 11, 2022

View reviewed changes

pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved

rebecca-palmer commented Dec 11, 2022

View reviewed changes

pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved

rebecca-palmer commented Dec 11, 2022

View reviewed changes

rebecca-palmer added 2 commits December 11, 2022 16:01

fix typo and formatting

9fe5091

fix formatting

bac5f68

MarcoGorelli self-requested a review December 11, 2022 17:11

rebecca-palmer added 5 commits December 11, 2022 17:30

fix test to not fail on rounding differences

535a4ec

Merge branch 'main' into datetimenearbounds

e0b9a8f

don't use approximate comparison on datetimes, it doesn't work

a5012a9

also can't convert datetime to float

088eda5

match dtypes

8b9c050

rebecca-palmer marked this pull request as ready for review December 12, 2022 07:28

mroeschke added the Datetime Datetime data dtype label Dec 13, 2022

MarcoGorelli reviewed Dec 13, 2022

View reviewed changes

pandas/_libs/tslib.pyx Outdated Show resolved Hide resolved

MarcoGorelli suggested changes Dec 13, 2022

View reviewed changes

This was referenced Dec 15, 2022

REF: to_datetime handle non-object cases outside cython #50263

Merged

BUG: to_datetime with M or Y unit and non-round float #50301

Merged

github-actions bot added the Stale label Jan 13, 2023

mroeschke closed this Jan 18, 2023

mroeschke reopened this Jan 18, 2023

rebecca-palmer added 2 commits January 18, 2023 22:08

Merge branch 'main' into datetimenearbounds

400bb22

TST: don't try to use non-integer years (see pandas-dev#50301)

8ebc910

Merge branch 'main' into datetimenearbounds

5ccb96a

MarcoGorelli reviewed Jan 19, 2023

View reviewed changes

rebecca-palmer added 4 commits January 19, 2023 21:01

TST: don't cross an integer

c17883e

(tsmax_in_days happens to be close to an integer, and this is a test of rounding)

PERF: remove unnecessary copy

4bd1e2a

add whatsnew

bf5542e

Merge branch 'main' into datetimenearbounds

2795783

MarcoGorelli approved these changes Jan 19, 2023

View reviewed changes

mroeschke removed the Stale label Jan 20, 2023

mroeschke approved these changes Jan 20, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Jan 20, 2023

mroeschke merged commit 0b04174 into pandas-dev:main Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve to_datetime bounds checking #50183

Improve to_datetime bounds checking #50183

rebecca-palmer commented Dec 11, 2022 •

edited

Loading

rebecca-palmer Dec 11, 2022

rebecca-palmer Dec 11, 2022

rebecca-palmer Dec 11, 2022

MarcoGorelli left a comment

MarcoGorelli left a comment

github-actions bot commented Jan 13, 2023

mroeschke commented Jan 18, 2023

MarcoGorelli commented Jan 18, 2023

rebecca-palmer commented Jan 18, 2023

jbrockmendel commented Jan 19, 2023

rebecca-palmer commented Jan 19, 2023

MarcoGorelli left a comment

rebecca-palmer commented Jan 19, 2023

MarcoGorelli left a comment

rebecca-palmer commented Jan 20, 2023

mroeschke commented Jan 20, 2023

Improve to_datetime bounds checking #50183

Improve to_datetime bounds checking #50183

Conversation

rebecca-palmer commented Dec 11, 2022 • edited Loading

rebecca-palmer Dec 11, 2022

Choose a reason for hiding this comment

rebecca-palmer Dec 11, 2022

Choose a reason for hiding this comment

rebecca-palmer Dec 11, 2022

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 13, 2023

mroeschke commented Jan 18, 2023

MarcoGorelli commented Jan 18, 2023

rebecca-palmer commented Jan 18, 2023

jbrockmendel commented Jan 19, 2023

rebecca-palmer commented Jan 19, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

rebecca-palmer commented Jan 19, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

rebecca-palmer commented Jan 20, 2023

mroeschke commented Jan 20, 2023

rebecca-palmer commented Dec 11, 2022 •

edited

Loading