Skip to content

REF: mask values in loc.__setitem__ with bool indexer #45501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 31, 2022
58 changes: 56 additions & 2 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
is_empty_indexer,
is_exact_shape_match,
is_list_like_indexer,
is_scalar_indexer,
length_of_indexer,
)
from pandas.core.indexes.api import (
Expand Down Expand Up @@ -669,6 +670,56 @@ def _get_setitem_indexer(self, key):

return self._convert_to_indexer(key, axis=0)

@final
def _maybe_mask_setitem_value(self, indexer, value):
"""
If we have obj.iloc[mask] = arraylike and arraylike has the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean loc here?

Copy link
Member

@phofl phofl Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think arraylike is not correct, we do this only when the rhs has an index. We align not apply the mask too

df = pd.DataFrame({"a": [1, 2, 3], "b": 1})

df.loc[[False, False, True], "b"] = np.array([10, 11, 12])

Raises, which is what I would have expected

I think this should continue to raise

length as obj, we treat this as obj.iloc[mask] = arraylike[mask],
similar to Series.__setitem__.
"""
# TODO: why do we only do this for loc, not iloc?
# ser = pd.Series(range(5))
# mask = np.array([True, False, True, False, True])
# ser[mask] = ser * 2 # <- works
# ser.loc[mask] = ser * 2 # <- works
# ser.iloc[mask] = ser * 2 # <- raises
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loc and setitem align, while iloc does not. Hence this should not work for iloc. This was changed for 1.2 I think and was intended


if isinstance(indexer, tuple) and len(indexer) == 2:
pi, icols = indexer
ndim = getattr(value, "ndim", 0)
if ndim == 0:
pass
elif com.is_bool_indexer(pi) and len(value) == len(pi):
newkey = pi.nonzero()[0]

if is_scalar_indexer(icols, self.ndim - 1) and ndim == 1:
# e.g. test_loc_setitem_boolean_mask_allfalse
value = value[pi]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for the iloc case below

df = pd.DataFrame({"a": [1, 2, 3], "b": 1})

df.loc[[False, False, True], "a"] = pd.Series([10, 12, 13], index=[2, 1, 0])

Should also set 10

indexer = (newkey, icols)

elif (
isinstance(icols, np.ndarray)
and icols.dtype.kind == "i"
and len(icols) == 1
):
if ndim == 1:
value = value[pi]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as with iloc below.

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

df.loc[[False, False, True], ["a"]] = pd.Series([10, 11, 12], index=[2, 1, 0])

This should set 10 instead of NaN

indexer = (newkey, icols)
# TODO: is it sketchy that icols is an ndarray and value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, something like

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})


df.loc[[False, True, True], ["a"]] = pd.Series([10, 20, 30])

is valid and runs through there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess. still seems weird as the analogous behavior with ndarray doesn't work

arr = np.random.randn(3, 2)
values = np.array([1, 2, 3])

arr[:, 0] = values  # <- works
arr[:, [0]] = values  # <- raises

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd yes

# is 1D? i.e. should we require value to be 2D with
# value.shape[1] == 1?

elif ndim == 2 and value.shape[1] == 1:
if isinstance(value, ABCDataFrame):
value = value.iloc[newkey]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes bugs:

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

df.loc[[False, False, True], ["a"]] = pd.DataFrame({"a": [10, 20, 30]}, index=[2, 1, 0])

This returns

    a  b
0   1  4
1   2  5
2  10  6

on main, which is correct and

     a    b
0  1.0  4.0
1  2.0  5.0
2  NaN  6.0

on this branch, which is wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be aligned before applying iloc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will update+test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so using align_series/align_frame fixes the examples you've provided, but breaks test_loc_setitem_boolean_mask_allfalse, test_setitem_loc_only_false_indexer_dtype_changed, test_loc_setitem_all_false_boolean_two_blocks, all cases where mask.sum() == 0. ATM i can get the tests passing by putting in a check specific to that case, but its definitely a kludge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it looks like the align functions can not deal with all False boolean indexers if I understand this correctly. This looks like a bug in there...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set the indexer to something like indexer = (np.array([False, False, False]), 0) you get the identity.

else:
value = value[pi]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

´This case is for arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

indexer = (newkey, icols)
elif com.is_bool_indexer(indexer):
indexer = indexer.nonzero()[0]

return indexer, value

@final
def _tupleize_axis_indexer(self, key) -> tuple:
"""
Expand Down Expand Up @@ -727,6 +778,10 @@ def __setitem__(self, key, value):
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)

if self.name == "loc":
# TODO: should we do this for iloc too?
indexer, value = self._maybe_mask_setitem_value(indexer, value)

iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value, self.name)

Expand Down Expand Up @@ -1299,8 +1354,7 @@ def _convert_to_indexer(self, key, axis: int):

if com.is_bool_indexer(key):
key = check_bool_indexer(labels, key)
(inds,) = key.nonzero()
return inds
return key
else:
return self._get_listlike_indexer(key, axis)[1]
else:
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -931,7 +931,7 @@ def setitem(self, indexer, value):

if is_empty_indexer(indexer):
# GH#8669 empty indexers, test_loc_setitem_boolean_mask_allfalse
pass
values[indexer] = value

elif is_scalar_indexer(indexer, self.ndim):
# setting a single element for each dim and with a rhs that could
Expand Down