Skip to content

Remove undefined behavior from npy_datetimestruct_to_datetime #55151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Nov 15, 2023

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Sep 15, 2023

This helps the readability of things as is, but also helps us better implement bounds checking in a later PR to avoid overflow

@jbrockmendel
Copy link
Member

The current implementation is vendored from numpy, which recently expose this for us in their C API so we can de-duplicate it once our minimum version catches up.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 16, 2023

I think we can upstream this again once we have the UB cleaned up. It's a few years out and numpy doesn't have the amount of tests hitting this as we do right?

@mroeschke mroeschke added the Internals Related to non-user accessible pandas implementation label Sep 18, 2023
@WillAyd WillAyd requested a review from MarcoGorelli as a code owner October 17, 2023 21:10
@WillAyd
Copy link
Member Author

WillAyd commented Oct 17, 2023

@jbrockmendel I took this a bit further so you can see where this would head. gcc/clang/msvc all provide builtin functions to check for overflow, which can even be dispatched down at the hardware level. By leveraging those functions, we can avoid introducing any undefined behavior.

The next evolution of this could get rid of the check_dts_bounds function and just rely on the libraries / standard Python error handling, which should help rip out a lot of code and potentially avoid some errors in the checks

@simonjayhawkins simonjayhawkins added the Refactor Internal refactoring of code label Oct 18, 2023
@WillAyd WillAyd changed the title Refactor npy_datetimestruct_to_datetime Remove undefined behavior from npy_datetimestruct_to_datetime Oct 19, 2023
@WillAyd
Copy link
Member Author

WillAyd commented Oct 20, 2023

Hmm so going through this it turns out that our current bounds checking isn't quite correct.

>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145224193')

This gets stored as a struct with the following members / values:

{year = 1677, month = 9, day = 21, hour = 0, min = 12, sec = 43, us = 145224, ps = 193000, as = 0}

On main the switch case to convert this to a nanosecond resolution is:

            case NPY_FR_ps:
                ret = ((((days * 24 + dts->hour) * 60 + dts->min) * 60 +
                        dts->sec) *
                           1000000 +
                       dts->us) *
                          1000000 +
                      dts->ps;

Converting this to a Python function with the appropriate values:

>>> days = -106752
>>> hour = 0
>>> min = 12
>>> sec = 43
>>> us = 145224
>>> ((((days * 24 + hour) * 60 + min) * 60 + sec) * 1000000 + us) * 1000
-9223372036854776000

Which exceeds the limits of an int64.

Since the numpy datetimestruct does not store ns but we instead need to multiply us * 1000 to get that value, I think we need to adjust our Timestamp.min / Timestamp.max to be +1000/-1000 ns respectively to avoid overflow

check_dts_bounds(dts, creso)
try:
result = pydatetime_to_dt64(val, dts, reso=creso)
except OverflowError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick e->err. i dislike 1-letter names

iresult[i] = pydate_to_dt64(val, &dts)
check_dts_bounds(&dts)
try:
iresult[i] = pydate_to_dt64(val, &dts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just have pydate_to_dt64 raise OutOfBoundsDatetime to avoid having to re-raise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea giving the error handling a better pass as we speak. Thanks for the idea

@jbrockmendel
Copy link
Member

On main the switch case to convert this to a nanosecond resolution is: [...] case NPY_FR_ps

we dont use FR_ps anywhere. is this reachable?

@jbrockmendel
Copy link
Member

big picture im pretty happy to defer to @WillAyd on this, would like someone with stronger c-fu than me to review the c edits.

@WillAyd
Copy link
Member Author

WillAyd commented Oct 20, 2023

we dont use FR_ps anywhere. is this reachable?

Yea sorry typo on my end. But the problem exists in the NPY_FR_ns block as well. Essentially multiplying us * 1000 when you are within 1 microsecond of the theoretical storage limit will overflow

@jbrockmendel
Copy link
Member

I'll prioritize taking another look this afternoon

try:
ts.value = npy_datetimestruct_to_datetime(self._creso, &dts)
except OverflowError as err:
# TODO: create shared function to create this format from dts struct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth doing this as a preliminary. id be much more comfortable merging that as a standalone without the C stuff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I just took care of that here because its pretty small, but can split out if a deal breaker

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a deal breaker, just easier for me to merge promptly

@mroeschke mroeschke added this to the 2.2 milestone Nov 14, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WillAyd
Copy link
Member Author

WillAyd commented Nov 14, 2023

Any objections on this one @jbrockmendel or good to merge?

@jbrockmendel
Copy link
Member

taking another look now

@@ -194,6 +194,11 @@ cdef get_implementation_bounds(
raise NotImplementedError(reso)


cdef object dts_to_iso_string(npy_datetimestruct *dts):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the return here be str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason - didn't think much of it. Makes sense though - just pushed this up

@jbrockmendel
Copy link
Member

One nitpick, no objections. Haven't looked at the C code that closely

@mroeschke mroeschke merged commit e477eee into pandas-dev:main Nov 15, 2023
@mroeschke
Copy link
Member

Thanks @WillAyd

@jbrockmendel
Copy link
Member

I just started seeing a bunch of

  ../../pandas/_libs/src/vendored/numpy/datetime/np_datetime.c:355:40: warning: incompatible pointer types passing 'int64_t *' (aka 'long long *') to parameter of type 'long *' [-Wincompatible-pointer-types]
    return checked_int64_sub(year, 1970, result);
                                         ^~~~~~
  ../../pandas/_libs/src/vendored/numpy/datetime/np_datetime.c:45:69: note: expanded from macro 'checked_int64_sub'
  #define checked_int64_sub(a, b, res) __builtin_ssubl_overflow(a, b, res)
                                                                      ^~~
  ../../pandas/_libs/src/vendored/numpy/datetime/np_datetime.c:359:39: warning: incompatible pointer types passing 'int64_t *' (aka 'long long *') to parameter of type 'long *' [-Wincompatible-pointer-types]
    return checked_int64_mul(years, 12, result);
                                        ^~~~~~
  ../../pandas/_libs/src/vendored/numpy/datetime/np_datetime.c:46:69: note: expanded from macro 'checked_int64_mul'
  #define checked_int64_mul(a, b, res) __builtin_smull_overflow(a, b, res)

[truncated]

can/should we do something about this?

@WillAyd
Copy link
Member Author

WillAyd commented Nov 15, 2023

Yea we should. What platform are you on?

@jbrockmendel
Copy link
Member

Mac (pre-M1)

@WillAyd
Copy link
Member Author

WillAyd commented Nov 15, 2023

Ok thanks. The pre-processor condition on line 43 has to be to blame. Will see what I can find for macOS (hoping it's in CI logs too).

In your case if you deleted that #if directive and kept the true branch I think it would resolve your issues

@WillAyd WillAyd deleted the refactor-np-datetime branch November 15, 2023 22:04
@WillAyd
Copy link
Member Author

WillAyd commented Nov 15, 2023

#55977 should be a more permanent fix

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 15, 2023

Two other follow-up requests that are not time-sensitive but i want to make sure don't fall through the cracks: 1) try to upstream this to numpy so our versions dont diverge from theirs, 2) deduplicate the "catch OverflowError and re-raise as OutOfBoundsDatetime"

update: would it be difficult to define OutOfBoundsDatetime in the C code instead of in the cython?

@WillAyd
Copy link
Member Author

WillAyd commented Nov 15, 2023

Yea I agree moving that exception down to the extension would be great - just need to spend more time on that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants