Skip to content

REF: mask values in loc.__setitem__ with bool indexer #45501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 31, 2022

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

The motivation here is the 1-line change in Block.setitem, which will enable us to share Block.setitem with EABackedBlock.setitem.

cc @phofl thoughts on how to make this prettier or more robust?

# mask = np.array([True, False, True, False, True])
# ser[mask] = ser * 2 # <- works
# ser.loc[mask] = ser * 2 # <- works
# ser.iloc[mask] = ser * 2 # <- raises
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loc and setitem align, while iloc does not. Hence this should not work for iloc. This was changed for 1.2 I think and was intended

if ndim == 1:
value = value[pi]
indexer = (newkey, icols)
# TODO: is it sketchy that icols is an ndarray and value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, something like

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})


df.loc[[False, True, True], ["a"]] = pd.Series([10, 20, 30])

is valid and runs through there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess. still seems weird as the analogous behavior with ndarray doesn't work

arr = np.random.randn(3, 2)
values = np.array([1, 2, 3])

arr[:, 0] = values  # <- works
arr[:, [0]] = values  # <- raises

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd yes


elif ndim == 2 and value.shape[1] == 1:
if isinstance(value, ABCDataFrame):
value = value.iloc[newkey]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes bugs:

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

df.loc[[False, False, True], ["a"]] = pd.DataFrame({"a": [10, 20, 30]}, index=[2, 1, 0])

This returns

    a  b
0   1  4
1   2  5
2  10  6

on main, which is correct and

     a    b
0  1.0  4.0
1  2.0  5.0
2  NaN  6.0

on this branch, which is wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be aligned before applying iloc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will update+test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so using align_series/align_frame fixes the examples you've provided, but breaks test_loc_setitem_boolean_mask_allfalse, test_setitem_loc_only_false_indexer_dtype_changed, test_loc_setitem_all_false_boolean_two_blocks, all cases where mask.sum() == 0. ATM i can get the tests passing by putting in a check specific to that case, but its definitely a kludge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it looks like the align functions can not deal with all False boolean indexers if I understand this correctly. This looks like a bug in there...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set the indexer to something like indexer = (np.array([False, False, False]), 0) you get the identity.

if isinstance(value, ABCDataFrame):
value = value.iloc[newkey]
else:
value = value[pi]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

´This case is for arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

and len(icols) == 1
):
if ndim == 1:
value = value[pi]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as with iloc below.

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

df.loc[[False, False, True], ["a"]] = pd.Series([10, 11, 12], index=[2, 1, 0])

This should set 10 instead of NaN

@final
def _maybe_mask_setitem_value(self, indexer, value):
"""
If we have obj.iloc[mask] = arraylike and arraylike has the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean loc here?

Copy link
Member

@phofl phofl Jan 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think arraylike is not correct, we do this only when the rhs has an index. We align not apply the mask too

df = pd.DataFrame({"a": [1, 2, 3], "b": 1})

df.loc[[False, False, True], "b"] = np.array([10, 11, 12])

Raises, which is what I would have expected

I think this should continue to raise


if is_scalar_indexer(icols, self.ndim - 1) and ndim == 1:
# e.g. test_loc_setitem_boolean_mask_allfalse
value = value[pi]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for the iloc case below

df = pd.DataFrame({"a": [1, 2, 3], "b": 1})

df.loc[[False, False, True], "a"] = pd.Series([10, 12, 13], index=[2, 1, 0])

Should also set 10

@phofl
Copy link
Member

phofl commented Jan 20, 2022

This is tricky, in case of Series/DataFrame I think reindexing the value should work. In case of objects without indexes (e.g. np.array) this should raise I think

@jbrockmendel
Copy link
Member Author

This is tricky, in case of Series/DataFrame I think reindexing the value should work. In case of objects without indexes (e.g. np.array) this should raise I think

I like this idea, will update to let non-Series/DataFrame fall through.

@jbrockmendel
Copy link
Member Author

Updated to address comments and add tests. special-cases all-False cases with FIXME comments to punt on #45501 (comment).

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Refactor Internal refactoring of code labels Jan 23, 2022
@jreback
Copy link
Contributor

jreback commented Jan 23, 2022

@jreback jreback added this to the 1.5 milestone Jan 25, 2022
@jreback
Copy link
Contributor

jreback commented Jan 31, 2022

status here?

@jbrockmendel
Copy link
Member Author

status here?

ready to go. looking forward to some nice de-duplication this will allow.

@jreback jreback merged commit c64fbce into pandas-dev:main Jan 31, 2022
@jbrockmendel jbrockmendel deleted the ref-bool_indexer branch January 31, 2022 00:21
phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants