Skip to content

CI/TST: Don't require length for construct_1d_arraylike_from_scalar cast to float64 #47393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 22, 2022
5 changes: 4 additions & 1 deletion pandas/core/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,10 @@ def sanitize_array(
if dtype is not None and is_float_dtype(data.dtype) and is_integer_dtype(dtype):
# possibility of nan -> garbage
try:
subarr = _try_cast(data, dtype, copy, True)
# GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int
# casting aligning with IntCastingNaNError below
with np.errstate(invalid="ignore"):
subarr = _try_cast(data, dtype, copy, True)
except IntCastingNaNError:
warnings.warn(
"In a future version, passing float-dtype values containing NaN "
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -1696,7 +1696,10 @@ def construct_1d_arraylike_from_scalar(

else:

if length and is_integer_dtype(dtype) and isna(value):
if is_integer_dtype(dtype) and isna(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think still need the length here as this part of the code is this logic to determine the the dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. revert to original.

if not length:
# GH 47391: numpy > 1.24 will raise filling np.nan into int dtypes
return np.array([], dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a weird case. how does it happen? is it clear that we'd want to prioritize the dtype as being "right" instead of the value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted an example of the state that's reached in numpy/numpy#21784

Namely before this change, length=0 would pass all the if checks down to np.empty(0, dtype=integer).fill(np.nan) which in numpy < 1.24 would just return np.array([], dtype=integer) but in numpy >=1.24 will raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question (also for me from NumPy!) is which pandas code runs into this path. For pandas, the question is what the actual result should be (in the future). For me the question is how bad it will be if that code path breaks. Because especially if it is bad, we may want to make sure it doesn't break yet (from within NumPy).

(At this point I suspect that at least the NaN case may need a work-around in NumPy as well.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an example test where this is hit is

        s = Series([], index=pd.date_range(start="2018-01-01", periods=0), dtype=int)
        result = s.apply(lambda x: x)
        tm.assert_series_equal(result, s)

so I think when operating over these empty Series/DataFrames, the value representation is np.nan when construct_1d_arraylike_from_scalar(np.nan, length=0, dtype=dtype) is called.

construct_1d_arraylike_from_scalar has a if length and is_integer_dtype(dtype) and isna(value) condition to ensure that integer dtypes were not coerced to float64 (because we relied on np.empty(0, dtype=integer).fill(np.nan) == np.array([], dtype=integer)) for these empty Series/DataFrames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for these empty-like cases in pandas, it appears pandas was relying on np.empty(0, dtype=integer).full(np.nan) to preserve integer dtypes. Having pandas explicitly preserve the dtype for these empty cases e.g. np.array([], dtype=integer) is an okay change to make on our end IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. (but why not keep this bit unchanged and just put the subarr.fill(value) on L1715 inside a if length: to skip for all zero-length arrays?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure I can make the change there instead

# coerce if we have nan for an integer dtype
dtype = np.dtype("float64")
elif isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"):
Expand Down