Skip to content

Handle NaN in array_with_unit_to_datetime #48705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 13, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions pandas/_libs/tslib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,10 @@ def array_with_unit_to_datetime(
iresult = values.astype("i8", copy=False)
# fill missing values by comparing to NPY_NAT
mask = iresult == NPY_NAT
# Trying to Convert NaN to integer results in undefined
# behaviour, so handle it explicitly (see GH #48705)
if values.dtype.kind == "f":
mask |= values != values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the relevant hardware, what do you get for iresult when values is [np.nan]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ python3
Python 3.10.7 (main, Sep 11 2022, 08:41:56) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.

import numpy as np
np.array([np.nan]).astype("i8")
array([9223372036854775807])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg is getting 9223372036854775807 instead of 9223372036854775808 weird or is it just me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

42 would also be a valid result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy has no guarantees for NaN -> int64 casts currently! (Although they are better implemented, in a sense). So what you get is completely off-bets. For example on arm, you may get 0!
And that breaks mask = iresult == NPY_NAT? There should be tests failing here I would think?

However, NumPy does have guarantees for NaN -> datetime/timedelta casts, so going via i8 seems dubious in this case? The cast triggers undefined behavior in C.

I am not sure what you mean with:

42 would also be a valid result.

@andreas-schwab?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sure why you ask, you already gave the answer above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misread the code. In any case, NumPy handles the cast correctly for datetime64, even if it uses a slower/weirder code path (currently, this is very much fixable!). So the question is wether it doesn't make more sense to cast to datetime64.

Further, the above may cause floating-point warnings in newer NumPy versions, which is something you probably don't want to see!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreas-schwab i'd be OK with either a) investigating seberg's suggestions for a more in-depth "right way" to handle this, or b) adding a comment pointing back to this thread and being done with it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine. can you add a comment pointing back to this PR and a few words about why we're doing this

iresult[mask] = 0
fvalues = iresult.astype("f8") * mult
need_to_iterate = False
Expand Down