Skip to content

CI/TST: Don't require length for construct_1d_arraylike_from_scalar cast to float64 #47393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 22, 2022
5 changes: 4 additions & 1 deletion pandas/core/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,10 @@ def sanitize_array(
if dtype is not None and is_float_dtype(data.dtype) and is_integer_dtype(dtype):
# possibility of nan -> garbage
try:
subarr = _try_cast(data, dtype, copy, True)
# GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int
# casting aligning with IntCastingNaNError below
with np.errstate(invalid="ignore"):
subarr = _try_cast(data, dtype, copy, True)
except IntCastingNaNError:
warnings.warn(
"In a future version, passing float-dtype values containing NaN "
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -1696,7 +1696,10 @@ def construct_1d_arraylike_from_scalar(

else:

if length and is_integer_dtype(dtype) and isna(value):
if is_integer_dtype(dtype) and isna(value):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think still need the length here as this part of the code is this logic to determine the the dtype

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. revert to original.

if not length:
# GH 47391: numpy > 1.24 will raise filling np.nan into int dtypes
return np.array([], dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a weird case. how does it happen? is it clear that we'd want to prioritize the dtype as being "right" instead of the value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted an example of the state that's reached in numpy/numpy#21784

Namely before this change, length=0 would pass all the if checks down to np.empty(0, dtype=integer).fill(np.nan) which in numpy < 1.24 would just return np.array([], dtype=integer) but in numpy >=1.24 will raise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question (also for me from NumPy!) is which pandas code runs into this path. For pandas, the question is what the actual result should be (in the future). For me the question is how bad it will be if that code path breaks. Because especially if it is bad, we may want to make sure it doesn't break yet (from within NumPy).

(At this point I suspect that at least the NaN case may need a work-around in NumPy as well.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So an example test where this is hit is

        s = Series([], index=pd.date_range(start="2018-01-01", periods=0), dtype=int)
        result = s.apply(lambda x: x)
        tm.assert_series_equal(result, s)

so I think when operating over these empty Series/DataFrames, the value representation is np.nan when construct_1d_arraylike_from_scalar(np.nan, length=0, dtype=dtype) is called.

construct_1d_arraylike_from_scalar has a if length and is_integer_dtype(dtype) and isna(value) condition to ensure that integer dtypes were not coerced to float64 (because we relied on np.empty(0, dtype=integer).fill(np.nan) == np.array([], dtype=integer)) for these empty Series/DataFrames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for these empty-like cases in pandas, it appears pandas was relying on np.empty(0, dtype=integer).full(np.nan) to preserve integer dtypes. Having pandas explicitly preserve the dtype for these empty cases e.g. np.array([], dtype=integer) is an okay change to make on our end IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. (but why not keep this bit unchanged and just put the subarr.fill(value) on L1715 inside a if length: to skip for all zero-length arrays?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure I can make the change there instead

# coerce if we have nan for an integer dtype
dtype = np.dtype("float64")
elif isinstance(dtype, np.dtype) and dtype.kind in ("U", "S"):
Expand Down
32 changes: 18 additions & 14 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1200,23 +1200,27 @@ def _maybe_coerce_merge_keys(self) -> None:

# check whether ints and floats
elif is_integer_dtype(rk.dtype) and is_float_dtype(lk.dtype):
if not (lk == lk.astype(rk.dtype))[~np.isnan(lk)].all():
warnings.warn(
"You are merging on int and float "
"columns where the float values "
"are not equal to their int representation.",
UserWarning,
)
# GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 1.5 we ought to actually remove the nans first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in 1.5. add a deprecation noting that nans will be dropped?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no i mean i think u can remove the nans before comparing to avoid the warning (this is all internal anyhow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. Yeah can clean this for 1.5 in a separate PR

with np.errstate(invalid="ignore"):
if not (lk == lk.astype(rk.dtype))[~np.isnan(lk)].all():
warnings.warn(
"You are merging on int and float "
"columns where the float values "
"are not equal to their int representation.",
UserWarning,
)
continue

elif is_float_dtype(rk.dtype) and is_integer_dtype(lk.dtype):
if not (rk == rk.astype(lk.dtype))[~np.isnan(rk)].all():
warnings.warn(
"You are merging on int and float "
"columns where the float values "
"are not equal to their int representation.",
UserWarning,
)
# GH 47391 numpy > 1.24 will raise a RuntimeError for nan -> int
with np.errstate(invalid="ignore"):
if not (rk == rk.astype(lk.dtype))[~np.isnan(rk)].all():
warnings.warn(
"You are merging on int and float "
"columns where the float values "
"are not equal to their int representation.",
UserWarning,
)
continue

# let's infer and see if we are ok
Expand Down