-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Remove undefined behavior from npy_datetimestruct_to_datetime #55151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The current implementation is vendored from numpy, which recently expose this for us in their C API so we can de-duplicate it once our minimum version catches up. |
I think we can upstream this again once we have the UB cleaned up. It's a few years out and numpy doesn't have the amount of tests hitting this as we do right? |
@jbrockmendel I took this a bit further so you can see where this would head. gcc/clang/msvc all provide builtin functions to check for overflow, which can even be dispatched down at the hardware level. By leveraging those functions, we can avoid introducing any undefined behavior. The next evolution of this could get rid of the |
Hmm so going through this it turns out that our current bounds checking isn't quite correct. >>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145224193') This gets stored as a struct with the following members / values: {year = 1677, month = 9, day = 21, hour = 0, min = 12, sec = 43, us = 145224, ps = 193000, as = 0} On main the switch case to convert this to a nanosecond resolution is: case NPY_FR_ps:
ret = ((((days * 24 + dts->hour) * 60 + dts->min) * 60 +
dts->sec) *
1000000 +
dts->us) *
1000000 +
dts->ps; Converting this to a Python function with the appropriate values: >>> days = -106752
>>> hour = 0
>>> min = 12
>>> sec = 43
>>> us = 145224
>>> ((((days * 24 + hour) * 60 + min) * 60 + sec) * 1000000 + us) * 1000
-9223372036854776000 Which exceeds the limits of an int64. Since the numpy datetimestruct does not store ns but we instead need to multiply us * 1000 to get that value, I think we need to adjust our Timestamp.min / Timestamp.max to be +1000/-1000 ns respectively to avoid overflow |
pandas/_libs/tslibs/conversion.pyx
Outdated
check_dts_bounds(dts, creso) | ||
try: | ||
result = pydatetime_to_dt64(val, dts, reso=creso) | ||
except OverflowError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick e->err. i dislike 1-letter names
pandas/_libs/tslib.pyx
Outdated
iresult[i] = pydate_to_dt64(val, &dts) | ||
check_dts_bounds(&dts) | ||
try: | ||
iresult[i] = pydate_to_dt64(val, &dts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just have pydate_to_dt64 raise OutOfBoundsDatetime to avoid having to re-raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea giving the error handling a better pass as we speak. Thanks for the idea
we dont use FR_ps anywhere. is this reachable? |
big picture im pretty happy to defer to @WillAyd on this, would like someone with stronger c-fu than me to review the c edits. |
Yea sorry typo on my end. But the problem exists in the NPY_FR_ns block as well. Essentially multiplying |
I'll prioritize taking another look this afternoon |
pandas/_libs/tslibs/timestamps.pyx
Outdated
try: | ||
ts.value = npy_datetimestruct_to_datetime(self._creso, &dts) | ||
except OverflowError as err: | ||
# TODO: create shared function to create this format from dts struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth doing this as a preliminary. id be much more comfortable merging that as a standalone without the C stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I just took care of that here because its pretty small, but can split out if a deal breaker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a deal breaker, just easier for me to merge promptly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Any objections on this one @jbrockmendel or good to merge? |
taking another look now |
pandas/_libs/tslibs/np_datetime.pyx
Outdated
@@ -194,6 +194,11 @@ cdef get_implementation_bounds( | |||
raise NotImplementedError(reso) | |||
|
|||
|
|||
cdef object dts_to_iso_string(npy_datetimestruct *dts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the return here be str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason - didn't think much of it. Makes sense though - just pushed this up
One nitpick, no objections. Haven't looked at the C code that closely |
Thanks @WillAyd |
I just started seeing a bunch of
can/should we do something about this? |
Yea we should. What platform are you on? |
Mac (pre-M1) |
Ok thanks. The pre-processor condition on line 43 has to be to blame. Will see what I can find for macOS (hoping it's in CI logs too). In your case if you deleted that #if directive and kept the true branch I think it would resolve your issues |
#55977 should be a more permanent fix |
Two other follow-up requests that are not time-sensitive but i want to make sure don't fall through the cracks: 1) try to upstream this to numpy so our versions dont diverge from theirs, 2) deduplicate the "catch OverflowError and re-raise as OutOfBoundsDatetime" update: would it be difficult to define OutOfBoundsDatetime in the C code instead of in the cython? |
Yea I agree moving that exception down to the extension would be great - just need to spend more time on that |
This helps the readability of things as is, but also helps us better implement bounds checking in a later PR to avoid overflow