Skip to content

Handle NaN in array_with_unit_to_datetime #48705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 13, 2022
Merged

Handle NaN in array_with_unit_to_datetime #48705

merged 3 commits into from
Dec 13, 2022

Conversation

andreas-schwab
Copy link
Contributor

@andreas-schwab andreas-schwab commented Sep 22, 2022

closes #48610

Missing values in arrays can either be NaT (for integer types) or NaN (for
floating types). Make sure to also handle the latter. This fixes all
failing tests.

@mroeschke
Copy link
Member

Does this fix an existing Github issue? Would need tests as well

@mroeschke mroeschke added Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 22, 2022
@andreas-schwab
Copy link
Contributor Author

andreas-schwab commented Sep 22, 2022 via email

@mroeschke
Copy link
Member

You mentioned

This fixes all failing tests

What tests were failing?

@andreas-schwab
Copy link
Contributor Author

andreas-schwab commented Sep 22, 2022 via email

@mroeschke
Copy link
Member

So this is related to #48610?

@andreas-schwab
Copy link
Contributor Author

andreas-schwab commented Sep 22, 2022 via email

@mroeschke mroeschke added this to the 1.6 milestone Sep 23, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay. cc @jbrockmendel if any further comments.

@@ -301,6 +301,9 @@ def array_with_unit_to_datetime(
iresult = values.astype("i8", copy=False)
# fill missing values by comparing to NPY_NAT
mask = iresult == NPY_NAT
# also handle NaN
if values.dtype.kind == "f":
mask |= values != values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the relevant hardware, what do you get for iresult when values is [np.nan]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ python3
Python 3.10.7 (main, Sep 11 2022, 08:41:56) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.

import numpy as np
np.array([np.nan]).astype("i8")
array([9223372036854775807])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg is getting 9223372036854775807 instead of 9223372036854775808 weird or is it just me?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

42 would also be a valid result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy has no guarantees for NaN -> int64 casts currently! (Although they are better implemented, in a sense). So what you get is completely off-bets. For example on arm, you may get 0!
And that breaks mask = iresult == NPY_NAT? There should be tests failing here I would think?

However, NumPy does have guarantees for NaN -> datetime/timedelta casts, so going via i8 seems dubious in this case? The cast triggers undefined behavior in C.

I am not sure what you mean with:

42 would also be a valid result.

@andreas-schwab?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sure why you ask, you already gave the answer above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misread the code. In any case, NumPy handles the cast correctly for datetime64, even if it uses a slower/weirder code path (currently, this is very much fixable!). So the question is wether it doesn't make more sense to cast to datetime64.

Further, the above may cause floating-point warnings in newer NumPy versions, which is something you probably don't want to see!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreas-schwab i'd be OK with either a) investigating seberg's suggestions for a more in-depth "right way" to handle this, or b) adding a comment pointing back to this thread and being done with it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine. can you add a comment pointing back to this PR and a few words about why we're doing this

@andreas-schwab
Copy link
Contributor Author

andreas-schwab commented Oct 11, 2022 via email

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Dec 3, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Missing values in arrays can either be NaT (for integer types) or NaN (for
floating types).  Make sure to also handle the latter.  This fixes all
failing tests.
@MarcoGorelli
Copy link
Member

Would https://github.com/pandas-dev/pandas/pull/50183/files also address the issue?

@andreas-schwab
Copy link
Contributor Author

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, and I think the comment's clear enough now. I've just merged with upstream/main to get CI green

@MarcoGorelli MarcoGorelli merged commit 6598797 into pandas-dev:main Dec 13, 2022
@MarcoGorelli
Copy link
Member

thanks @andreas-schwab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas.to_datetime fails to handle numpy.nan on riscv64 due to dependency on undefined behaviour
5 participants