-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Handle NaN in array_with_unit_to_datetime #48705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does this fix an existing Github issue? Would need tests as well |
The existing tests already cover it perfectly.
|
You mentioned
What tests were failing? |
https://build.opensuse.org/package/live_build_log/openSUSE:Factory:RISCV/python-pandas:test-py310/standard/riscv64
=========================== short test summary info ============================
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_convert_dates_infer[trade_time]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_convert_dates_infer[date]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_convert_dates_infer[datetime]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_convert_dates_infer[sold_at]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_convert_dates_infer[modified]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_convert_dates_infer[timestamp]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_convert_dates_infer[timestamps]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_date_unit[s]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_date_unit[ms]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_date_unit[us]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_to_json_from_json_columns_dtypes[split]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_to_json_from_json_columns_dtypes[records]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_to_json_from_json_columns_dtypes[index]
FAILED pandas/tests/io/json/test_pandas.py::TestPandasContainer::test_to_json_from_json_columns_dtypes[columns]
FAILED pandas/tests/tools/test_to_datetime.py::TestToDatetimeUnit::test_to_datetime_unit_with_nulls[nan]
= 15 failed, 143030 passed, 9020 skipped, 1047 xfailed, 6 xpassed, 18 warnings in 16656.29s (4:37:36) =
|
So this is related to #48610? |
Yes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. cc @jbrockmendel if any further comments.
@@ -301,6 +301,9 @@ def array_with_unit_to_datetime( | |||
iresult = values.astype("i8", copy=False) | |||
# fill missing values by comparing to NPY_NAT | |||
mask = iresult == NPY_NAT | |||
# also handle NaN | |||
if values.dtype.kind == "f": | |||
mask |= values != values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the relevant hardware, what do you get for iresult when values is [np.nan]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ python3
Python 3.10.7 (main, Sep 11 2022, 08:41:56) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
import numpy as np
np.array([np.nan]).astype("i8")
array([9223372036854775807])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg is getting 9223372036854775807 instead of 9223372036854775808 weird or is it just me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
42 would also be a valid result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy has no guarantees for NaN -> int64
casts currently! (Although they are better implemented, in a sense). So what you get is completely off-bets. For example on arm, you may get 0!
And that breaks mask = iresult == NPY_NAT
? There should be tests failing here I would think?
However, NumPy does have guarantees for NaN -> datetime/timedelta
casts, so going via i8
seems dubious in this case? The cast triggers undefined behavior in C.
I am not sure what you mean with:
42 would also be a valid result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sure why you ask, you already gave the answer above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misread the code. In any case, NumPy handles the cast correctly for datetime64, even if it uses a slower/weirder code path (currently, this is very much fixable!). So the question is wether it doesn't make more sense to cast to datetime64.
Further, the above may cause floating-point warnings in newer NumPy versions, which is something you probably don't want to see!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreas-schwab i'd be OK with either a) investigating seberg's suggestions for a more in-depth "right way" to handle this, or b) adding a comment pointing back to this thread and being done with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine. can you add a comment pointing back to this PR and a few words about why we're doing this
$ python3
Python 3.10.7 (main, Sep 11 2022, 08:41:56) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
>> import numpy as np
>> np.array([np.nan]).astype("i8")
array([9223372036854775807])
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Missing values in arrays can either be NaT (for integer types) or NaN (for floating types). Make sure to also handle the latter. This fixes all failing tests.
Would https://github.com/pandas-dev/pandas/pull/50183/files also address the issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, and I think the comment's clear enough now. I've just merged with upstream/main to get CI green
thanks @andreas-schwab |
closes #48610
Missing values in arrays can either be NaT (for integer types) or NaN (for
floating types). Make sure to also handle the latter. This fixes all
failing tests.