-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: assignment to multiple columns when some column do not exist #29334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6e4fe48
to
26bd7cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests look good and conceptually ideas are find, but need to add the code in reasonable places w/o dealing with complexity of core/indexing.py
some suggestions in-line
pandas/core/frame.py
Outdated
@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value): | |||
for k1, k2 in zip(key, value.columns): | |||
self[k1] = value[k2] | |||
else: | |||
if all(is_hashable(k) for k in key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so can you add this in a routine on self.loc
call it
_ensure_listlike_indexers(key, axis=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this code always assume that axis
= 1. Do we need to allow adding multiple rows via assignment too?
pandas/core/indexing.py
Outdated
@@ -165,7 +166,28 @@ def _get_loc(self, key: int, axis: int): | |||
def _slice(self, obj, axis: int, kind=None): | |||
return self.obj._slice(obj, axis=axis, kind=kind) | |||
|
|||
def _ensure_listlike_indexer(self, key, axis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls type these arguments & add a doc-string. make sure to indicate that this is a mutating operation.
pandas/core/indexing.py
Outdated
if ( | ||
self.name == "loc" # column is indexed by name | ||
and isinstance(key, tuple) | ||
and len(key) >= 2 # key is at least 2-dimensional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have a lot of condictions here, when are you trying to ensure this is being called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.name == "loc"
ensures that this indexer is is indexed by name (yes for loc
, no for ix
).
isinstance(key, tuple) and len(key) >= 2
filter out indexers with only 1 dimension.
is_list_like_indexer(key[1])
ensures that the indexer is indexed by multiple columns.
not com.is_bool_indexer(key[1])
ensures that the indexer is not a boolean indexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger questions, is why is this needed if you are already doing this for frames above.
Exactly which test hits this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger questions, is why is this needed if you are already doing this for frames above.
Exactly which test hits this?
See below.
) | ||
def test_setitem_list_missing_columns(self, float_frame, columns, box): | ||
# GH 26534 | ||
result = float_frame.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather you not construct the expected frame like this, rather include it as another argument in the parameterization; essentially hard code it.
pandas/core/indexing.py
Outdated
@@ -165,7 +166,37 @@ def _get_loc(self, key: int, axis: int): | |||
def _slice(self, obj, axis: int, kind=None): | |||
return self.obj._slice(obj, axis=axis, kind=kind) | |||
|
|||
def _ensure_listlike_indexer(self, key, axis: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type key (ok for a followon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to type key as it can be any listlike type. I was just following the type of _get_listlike_indexer
which leaves key untyped too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it would be really great to type the indexing keys
Indexable=Iterable
might be enough for now (you can add/import from pandas._typing)
can this be a scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be a scalar?
No.
pandas/core/indexing.py
Outdated
if ( | ||
self.name == "loc" # column is indexed by name | ||
and isinstance(key, tuple) | ||
and len(key) >= 2 # key is at least 2-dimensional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).
pandas/core/indexing.py
Outdated
if ( | ||
self.name == "loc" # column is indexed by name | ||
and isinstance(key, tuple) | ||
and len(key) >= 2 # key is at least 2-dimensional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger questions, is why is this needed if you are already doing this for frames above.
Exactly which test hits this?
@jreback any more changes required? |
pandas/core/indexing.py
Outdated
@@ -1733,6 +1734,37 @@ def _getitem_axis(self, key, axis: int): | |||
self._validate_key(key, axis) | |||
return self._get_label(key, axis=axis) | |||
|
|||
def _ensure_listlike_indexer(self, key: Iterable, axis: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this by-definition should always be on axis=1, yes? i would remove that as an argument and/or be very explicit about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this by-definition should always be on axis=1, yes?
Yes, axis
is always 1.
i would remove that as an argument and/or be very explicit about this.
How to be explicit? Document it in docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howsiwei thanks for sticking with it here. indexing is already very complicated, I am trying to ensure your change doesn't make it even more complicated. a few more comments.
pandas/core/indexing.py
Outdated
Whether key is a _LocIndexer key | ||
""" | ||
column_axis = 1 | ||
if is_indexer_key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really need is_indexer_key? this seems very odd, what exactly happens if you remove this keyword (specifically what about the below code is a problem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we need some way to differentiate whether _ensure_listlike_indexer
is called from DataFrame::_setitem_array
or _LocIndexer::_get_setitem_indexer
because key
s are in different formats depending on the caller. I have briefly explained this at #29334 (comment).
Notice that at if is_indexer_key
is true, then I assign
pandas/pandas/core/indexing.py
Line 1754 in 8ccd2b8
key = key[column_axis] |
is_indexer_key
is absolutely essential and the code below the line doesn't make sense if is_indexer_key
is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you could remove it if you also check is_list_like(key).
i just find this having to separate the keywords very hard to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_list_like
won't work as it returns True
for tuple
and Series
too.
The problem is that _LocIndexer
accepts many possible types so it's hard to tell whether key
is from _LocIndexer
using the type of key
alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just exclude tuples
you can use self.name == 'loc' if you just want to handle that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Currently an error is caused by the difference between df
and df.loc
. Eg.
>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=["a", "b"])
>>> df
a b
a 1 3
b 2 4
>>> df[["a"]]
a
a 1
b 2
>>> df.loc[["a"]]
a b
a 1 3
It seems impossible to resolve this without explicitly indicating where key
comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback any suggestions?
pandas/core/indexing.py
Outdated
column_axis = 1 | ||
if is_indexer_key: | ||
if not ( | ||
isinstance(key, tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put a blank line between conditions where you add a comment for each one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this?
if not (
isinstance(key, tuple)
# key is at least 2-dimensional
and len(key) >= 2
# key indexes multiple columns
and is_list_like_indexer(key[column_axis])
and not com.is_bool_indexer(key[column_axis])
):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is better
pandas/core/indexing.py
Outdated
self.obj[k] = np.nan | ||
|
||
def _get_setitem_indexer(self, key): | ||
self._ensure_listlike_indexer(key, is_indexer_key=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally can you move this to the base class (so don't need to override here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where should overriding happen then? Define _ensure_listlike_indexer
in DataFrame
class too and override _ensure_listlike_indexer
in _LocIndexer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I mean move to _NDFrameIndexer (as opposed to _LocIndexer), then you won't have to override, you can just always call it in _get_setitem_indexer; that is the goal here.
we want to avoid special casing in the sub-classes of indexers, and allow _ensure_listlike_indxer to be called always (it just won't do anything if its conditions are not met)
close on this, thanks for working on it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's a typo. I wanted to say _NDFrameIndexer
instead of DataFrame
.
How can we ensure that conditions are only met if indexer is a _LocIndexer
?
pandas/core/indexing.py
Outdated
column_axis = 1 | ||
if is_indexer_key: | ||
if not ( | ||
isinstance(key, tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that is better
pandas/core/indexing.py
Outdated
): | ||
return | ||
key = key[column_axis] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment explaning this
pandas/core/indexing.py
Outdated
Whether key is a _LocIndexer key | ||
""" | ||
column_axis = 1 | ||
if is_indexer_key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you could remove it if you also check is_list_like(key).
i just find this having to separate the keywords very hard to follow.
pandas/core/indexing.py
Outdated
self.obj[k] = np.nan | ||
|
||
def _get_setitem_indexer(self, key): | ||
self._ensure_listlike_indexer(key, is_indexer_key=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I mean move to _NDFrameIndexer (as opposed to _LocIndexer), then you won't have to override, you can just always call it in _get_setitem_indexer; that is the goal here.
we want to avoid special casing in the sub-classes of indexers, and allow _ensure_listlike_indxer to be called always (it just won't do anything if its conditions are not met)
close on this, thanks for working on it!
pandas/core/indexing.py
Outdated
Whether key is a _LocIndexer key | ||
""" | ||
column_axis = 1 | ||
if is_indexer_key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just exclude tuples
you can use self.name == 'loc' if you just want to handle that case.
pls rebase as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! ping on green.
pandas/core/indexing.py
Outdated
column_axis = 1 | ||
|
||
# check if self.obj is at least 2-dimensional | ||
if len(self.obj.shape) <= column_axis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be
if self.obj.ndim != 2:
return
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed to
if self.ndim != 2
return
@howsiwei some CI failures now, if you have a chance to look. |
@TomAugspurger I'm not sure how to fix it. See #29334 (comment) |
Thanks for the update. I'm not sure either :/ |
@howsiwei is this still active? Can you fix merge conflicts and try to get CI green? Also want to move whatsnew to 1.1.0 at this point |
@WillAyd I'm not sure how to proceed due to #29334 (comment). I can get the CI green with my previous approach ( pandas/pandas/core/indexing.py Line 1731 in 8ccd2b8
.loc[] or [] but that was advised against by @jreback.
|
you can pass axis |
@howsiwei can you merge master to resolve conflicts |
@howsiwei can you fix up merge conflict? Otherwise should be good to merge here |
thanks @howsiwei lgtm. @jbrockmendel if any comments. |
|
||
Parameters | ||
---------- | ||
key : _LocIndexer key or list-like of column labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think key being a _LocIndexer object makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? _ensure_listlike_indexer
is called at
pandas/pandas/core/indexing.py
Line 586 in d7184e7
self._ensure_listlike_indexer(key) |
---------- | ||
key : _LocIndexer key or list-like of column labels | ||
Target labels. | ||
axis : key axis if known |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like axis isnt used here, is it necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's used at
pandas/pandas/core/indexing.py
Line 642 in d7184e7
axis == column_axis |
For rationale of coding this way, see #29334 (comment).
@@ -2685,6 +2685,7 @@ def _setitem_array(self, key, value): | |||
for k1, k2 in zip(key, value.columns): | |||
self[k1] = value[k2] | |||
else: | |||
self.loc._ensure_listlike_indexer(key, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like you're also calling this inside _get_setitem_indexer; do you need to also call it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_setitem_indexer
is not called in this code path
pandas/core/indexing.py
Outdated
|
||
if ( | ||
axis == column_axis | ||
and not isinstance(self.obj._get_axis(column_axis), ABCMultiIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.obj._get_axis(column_axis)
--> self.obj.columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, changed
and not isinstance(self.obj._get_axis(column_axis), ABCMultiIndex) | ||
and is_list_like_indexer(key) | ||
and not com.is_bool_indexer(key) | ||
and all(is_hashable(k) for k in key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the hashable check you can skip and the calls below will take care of it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no hashable check is performed here, some test cases fail , eg.
pandas/pandas/tests/indexing/test_indexing.py
Line 112 in d7184e7
def test_setitem_ndarray_3d(self, index, obj, idxr, idxr_id): |
@howsiwei this looks good. can you merge master and ping on green (just to make sure) |
@jreback merged master. It's green now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all looks good, but need to move the whatsnew note. ping on green. sorry this has taken a while, we are at the finish line!
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -267,6 +267,35 @@ Indexing | |||
- Bug in :meth:`Series.loc` and :meth:`DataFrame.loc` when indexing with an integer key on a object-dtype :class:`Index` that is not all-integers (:issue:`31905`) | |||
- Bug in :meth:`DataFrame.iloc.__setitem__` on a :class:`DataFrame` with duplicate columns incorrectly setting values for all matching columns (:issue:`15686`, :issue:`22036`) | |||
|
|||
Assignment to multiple columns of a DataFrame when some columns do not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this to the section where we od api-breaking changes.
thanks @howsiwei very nice |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Previous PR: #26534
In particular, the following code now behaves correctly.
v0.22: error
master: column
'c'
is converted to index-1
, which causes last column to be overwritten.After this PR: