-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix bug in loc setitem changing the dtype when condition is False #37672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
� Conflicts: � pandas/tests/frame/indexing/test_setitem.py
pandas/core/internals/blocks.py
Outdated
else: | ||
# e.g. we are bool dtype and value is nan | ||
# TODO: watch out for case with listlike value and scalar/empty indexer | ||
if is_list_like(value) and is_empty_indexer(indexer, np.array(value)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this as another elif case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/internals/blocks.py
Outdated
@@ -920,10 +920,11 @@ def setitem(self, indexer, value): | |||
|
|||
elif lib.is_scalar(value) and not isna(value): | |||
dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True) | |||
|
|||
elif is_list_like(value) and is_empty_indexer(indexer, np.array(value)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any way to avoid the np.array call? i think that makes a copy if we have eg a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, is_empty_indexer
requires an ndarray, while value may be a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test is for loc.__setitem__
. is that (or i guess iloc.__setitem__
) the only way we can get here? it may make sense to do this in setitem_with_indexer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be possible I think.
Sport would be in
pandas/pandas/core/indexing.py
Line 1567 in 524fc9c
for i, idx in enumerate(indexer): |
Ran the tests in indexing, frame/indexing and series/indexing locally, this seems to work. Let me know, if we should move this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use np.asarray, also can we check len of the indexer here first to short-cut?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx used asarray. Indexer is a tuple containing an empty list, len does not work here unfortunately.
@@ -298,6 +298,13 @@ def test_iloc_setitem_bool_indexer(self, klass): | |||
expected = DataFrame({"flag": ["x", "y", "z"], "value": [2, 3, 4]}) | |||
tm.assert_frame_equal(df, expected) | |||
|
|||
def test_setitem_only_false_indexer_dtype_changed(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this below in the class TestDataFrameSetItemBooleanMask?
which part(s) of the line df.loc[[False], ["b"]] = 10 - df["c"]
are relevant to the bugfix here? e.g. does it matter that the right hand side is a Series? Would this be testing the right thing if it were series.loc[[False]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.loc[[False], ["b"]] = 10
did not trigger the error. We need a Series on the rhs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the test
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -469,6 +469,7 @@ Indexing | |||
- Bug in :meth:`Index.where` incorrectly casting numeric values to strings (:issue:`37591`) | |||
- Bug in :meth:`Series.loc` and :meth:`DataFrame.loc` raises when numeric label was given for object :class:`Index` although label was in :class:`Index` (:issue:`26491`) | |||
- Bug in :meth:`DataFrame.loc` returned requested key plus missing values when ``loc`` was applied to single level from :class:`MultiIndex` (:issue:`27104`) | |||
- Bug in :meth:`DataFrame.loc.__setitem__` changed dtype when indexer was completely ``False`` (:issue:`37550`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed -> changing
these should usually be participles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, will follow this in the future
@@ -339,3 +339,13 @@ def test_setitem_boolean_mask(self, mask_type, float_frame): | |||
expected = df.copy() | |||
expected.values[np.array(mask)] = np.nan | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_setitem_only_false_indexer_dtype_changed(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you comment and/or edit the test name to make clear
- is iloc affected?
- does it matter that you have
["b"]
instead of, say,"b"
or ":"?
comment that the bug is for the value being set being a Series obj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is Series affected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Series is not affected. Goes only wrong with ["b"]
Iloc
is actually also affected, but has to be fixed differently probably. In case of loc
we get an empty indexer. In case of iloc
we get an indexer like ([False],)
, which is not empty per is_empty_indexer
. We could adjust is_empty_indexer
or we have to go a different way here
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
pandas/core/internals/blocks.py
Outdated
@@ -920,10 +920,11 @@ def setitem(self, indexer, value): | |||
|
|||
elif lib.is_scalar(value) and not isna(value): | |||
dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True) | |||
|
|||
elif is_list_like(value) and is_empty_indexer(indexer, np.array(value)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use np.asarray, also can we check len of the indexer here first to short-cut?
expected = DataFrame({"a": ["a"], "b": [1], "c": [1]}) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df.loc[[False], ["b"]] = 10 - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simplify to 9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx
# GH#37550 | ||
# Dtype is only changed when value to set is a Series | ||
df = DataFrame({"a": ["a"], "b": [1], "c": [1]}) | ||
df.loc[[False], ["b"]] = 10 - df["c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make clear either in the test name and/or comment that/whether the type of indexer for ["b"]
is relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment and parametrized test
pandas/core/internals/blocks.py
Outdated
@@ -928,10 +928,11 @@ def setitem(self, indexer, value): | |||
|
|||
elif lib.is_scalar(value) and not isna(value): | |||
dtype, _ = infer_dtype_from_scalar(value, pandas_dtype=True) | |||
|
|||
elif is_list_like(value) and is_empty_indexer(indexer, np.asarray(value)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not sure this is the right place to catch this, since i think we are specifically interested in all-False boolean indexers coming through loc. does this get here via setitem_with_indexer, and if so, can you point out specifically which branch(es)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Through _setitem_with_indexer_split_path
-> _setitem_with_indexer_2d_value
-> _setitem_single_column
as mentioned in #37672 (comment) we could do that in setitem_with_indexer
if this is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _setitem_with_indexer_split_path L 1686-1688 we have a no-op block that is for all-false boolean masks. could move it up to before the 2D check. would that do the trick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible solution is moving up the lplane_indexer==0
condition and removing the value part. Probably the shortest and does not seem to break tests. Don't know if this kills behavior not tested, but if indexer has len=0, we probably do not want to set values anyway
� Conflicts: � doc/source/whatsnew/v1.2.0.rst � pandas/core/indexing.py
expected = DataFrame({"a": ["a"], "b": [1], "c": [1]}) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
df.loc[[False], ["b"]] = 9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should [False]
be func(value)
as above? if not, this might belong as a separate test
@phofl can you rebase and will look again |
merged master |
@jbrockmendel ok here? |
@phofl can you update |
Looks like the |
ok let's move off the milestone for rn, can always do after the rc |
@phofl status of this (merge master) |
@phofl worth fixing up? |
Yeah definitely. i hope that I can get back to this friday or friday next week |
# Conflicts: # doc/source/whatsnew/v1.3.0.rst # pandas/core/indexing.py
I moved the fix for the all False case. This fixes an inconsistency in _align_series and covers the all False indexer case. I could not figure the empty list thing out. The values look exactly the same as in the all False case. We probably will have to fix this someplace else, but not sure right now. I woild propose open an issue about that and go from there. |
lgtm @jbrockmendel if any comments |
will take a look this afternoon. (got booster yesterday, moving slow today) |
@@ -2058,6 +2058,8 @@ def ravel(i): | |||
# we have a frame, with multiple indexers on both axes; and a | |||
# series, so need to broadcast (see GH5206) | |||
if sum_aligners == self.ndim and all(is_sequence(_) for _ in indexer): | |||
if is_empty_indexer(indexer[0], ser._values): | |||
return ser._values.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
evidently it works, but this seems like a weird place to handle this. is my intuition wrong here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures, that an indexer like (np.array([]), np.array([1]))
is handled the same as (np.array([]), 1)
, hence I handled it here.
Does this relate to #43632? |
I don't think so. The all False case is an empty not a boolean indexer when jumping into the setitem logic |
def test_setitem_loc_empty_indexer_raises_with_non_empty_value(self, box): | ||
# GH#37672 | ||
df = DataFrame({"a": ["a"], "b": [1], "c": [1]}) | ||
if box == Series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick "box is Series", i can change this in my next CLN branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
thanks @phofl |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This fixes the bug, but I am not sure, if the solution is good enough....