Skip to content

BUG: assignment to multiple columns when some column do not exist #29334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 14, 2020

Conversation

howsiwei
Copy link
Contributor

@howsiwei howsiwei commented Nov 2, 2019

Previous PR: #26534

In particular, the following code now behaves correctly.

import pandas as pd
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
df[['a', 'c']] = 1
print(df)

v0.22: error

master: column 'c' is converted to index -1, which causes last column to be overwritten.

   a  b
0  1  1
1  1  1
2  1  1

After this PR:

   a  b  c
0  1  3  1
1  1  4  1
2  1  5  1

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests look good and conceptually ideas are find, but need to add the code in reasonable places w/o dealing with complexity of core/indexing.py

some suggestions in-line

@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value):
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
else:
if all(is_hashable(k) for k in key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so can you add this in a routine on self.loc

call it
_ensure_listlike_indexers(key, axis=1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this code always assume that axis = 1. Do we need to allow adding multiple rows via assignment too?

@gfyoung gfyoung added the Indexing Related to indexing on series/frames, not to indexes themselves label Nov 5, 2019
@@ -165,7 +166,28 @@ def _get_loc(self, key: int, axis: int):
def _slice(self, obj, axis: int, kind=None):
return self.obj._slice(obj, axis=axis, kind=kind)

def _ensure_listlike_indexer(self, key, axis):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls type these arguments & add a doc-string. make sure to indicate that this is a mutating operation.

if (
self.name == "loc" # column is indexed by name
and isinstance(key, tuple)
and len(key) >= 2 # key is at least 2-dimensional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have a lot of condictions here, when are you trying to ensure this is being called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.name == "loc" ensures that this indexer is is indexed by name (yes for loc, no for ix).
isinstance(key, tuple) and len(key) >= 2 filter out indexers with only 1 dimension.
is_list_like_indexer(key[1]) ensures that the indexer is indexed by multiple columns.
not com.is_bool_indexer(key[1]) ensures that the indexer is not a boolean indexer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigger questions, is why is this needed if you are already doing this for frames above.

Exactly which test hits this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigger questions, is why is this needed if you are already doing this for frames above.

Exactly which test hits this?

See below.

)
def test_setitem_list_missing_columns(self, float_frame, columns, box):
# GH 26534
result = float_frame.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather you not construct the expected frame like this, rather include it as another argument in the parameterization; essentially hard code it.

@@ -165,7 +166,37 @@ def _get_loc(self, key: int, axis: int):
def _slice(self, obj, axis: int, kind=None):
return self.obj._slice(obj, axis=axis, kind=kind)

def _ensure_listlike_indexer(self, key, axis: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type key (ok for a followon)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how to type key as it can be any listlike type. I was just following the type of _get_listlike_indexer which leaves key untyped too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it would be really great to type the indexing keys

Indexable=Iterable might be enough for now (you can add/import from pandas._typing)

can this be a scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be a scalar?

No.

if (
self.name == "loc" # column is indexed by name
and isinstance(key, tuple)
and len(key) >= 2 # key is at least 2-dimensional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).

if (
self.name == "loc" # column is indexed by name
and isinstance(key, tuple)
and len(key) >= 2 # key is at least 2-dimensional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigger questions, is why is this needed if you are already doing this for frames above.

Exactly which test hits this?

@howsiwei
Copy link
Contributor Author

@jreback any more changes required?

@@ -1733,6 +1734,37 @@ def _getitem_axis(self, key, axis: int):
self._validate_key(key, axis)
return self._get_label(key, axis=axis)

def _ensure_listlike_indexer(self, key: Iterable, axis: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this by-definition should always be on axis=1, yes? i would remove that as an argument and/or be very explicit about this.

Copy link
Contributor Author

@howsiwei howsiwei Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this by-definition should always be on axis=1, yes?

Yes, axis is always 1.

i would remove that as an argument and/or be very explicit about this.

How to be explicit? Document it in docstring?

@jreback jreback added this to the 1.0 milestone Dec 27, 2019
@jreback jreback added the Bug label Dec 27, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howsiwei thanks for sticking with it here. indexing is already very complicated, I am trying to ensure your change doesn't make it even more complicated. a few more comments.

Whether key is a _LocIndexer key
"""
column_axis = 1
if is_indexer_key:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really need is_indexer_key? this seems very odd, what exactly happens if you remove this keyword (specifically what about the below code is a problem)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we need some way to differentiate whether _ensure_listlike_indexer is called from DataFrame::_setitem_array or _LocIndexer::_get_setitem_indexer because keys are in different formats depending on the caller. I have briefly explained this at #29334 (comment).

Notice that at if is_indexer_key is true, then I assign

key = key[column_axis]
after some checks. So is_indexer_key is absolutely essential and the code below the line doesn't make sense if is_indexer_key is removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you could remove it if you also check is_list_like(key).

i just find this having to separate the keywords very hard to follow.

Copy link
Contributor Author

@howsiwei howsiwei Jan 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_list_like won't work as it returns True for tuple and Series too.

The problem is that _LocIndexer accepts many possible types so it's hard to tell whether key is from _LocIndexer using the type of key alone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just exclude tuples

you can use self.name == 'loc' if you just want to handle that case.

Copy link
Contributor Author

@howsiwei howsiwei Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Currently an error is caused by the difference between df and df.loc. Eg.

>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=["a", "b"])
>>> df
   a  b
a  1  3
b  2  4
>>> df[["a"]]
   a
a  1
b  2
>>> df.loc[["a"]]
   a  b
a  1  3

It seems impossible to resolve this without explicitly indicating where key comes from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback any suggestions?

column_axis = 1
if is_indexer_key:
if not (
isinstance(key, tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a blank line between conditions where you add a comment for each one

Copy link
Contributor Author

@howsiwei howsiwei Dec 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this?

if not (
    isinstance(key, tuple)

    # key is at least 2-dimensional
    and len(key) >= 2

    # key indexes multiple columns
    and is_list_like_indexer(key[column_axis])

    and not com.is_bool_indexer(key[column_axis])
):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is better

self.obj[k] = np.nan

def _get_setitem_indexer(self, key):
self._ensure_listlike_indexer(key, is_indexer_key=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally can you move this to the base class (so don't need to override here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should overriding happen then? Define _ensure_listlike_indexer in DataFrame class too and override _ensure_listlike_indexer in _LocIndexer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I mean move to _NDFrameIndexer (as opposed to _LocIndexer), then you won't have to override, you can just always call it in _get_setitem_indexer; that is the goal here.

we want to avoid special casing in the sub-classes of indexers, and allow _ensure_listlike_indxer to be called always (it just won't do anything if its conditions are not met)

close on this, thanks for working on it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it's a typo. I wanted to say _NDFrameIndexer instead of DataFrame.

How can we ensure that conditions are only met if indexer is a _LocIndexer?

column_axis = 1
if is_indexer_key:
if not (
isinstance(key, tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is better

):
return
key = key[column_axis]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment explaning this

Whether key is a _LocIndexer key
"""
column_axis = 1
if is_indexer_key:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you could remove it if you also check is_list_like(key).

i just find this having to separate the keywords very hard to follow.

self.obj[k] = np.nan

def _get_setitem_indexer(self, key):
self._ensure_listlike_indexer(key, is_indexer_key=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I mean move to _NDFrameIndexer (as opposed to _LocIndexer), then you won't have to override, you can just always call it in _get_setitem_indexer; that is the goal here.

we want to avoid special casing in the sub-classes of indexers, and allow _ensure_listlike_indxer to be called always (it just won't do anything if its conditions are not met)

close on this, thanks for working on it!

Whether key is a _LocIndexer key
"""
column_axis = 1
if is_indexer_key:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just exclude tuples

you can use self.name == 'loc' if you just want to handle that case.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2020

pls rebase as well

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! ping on green.

column_axis = 1

# check if self.obj is at least 2-dimensional
if len(self.obj.shape) <= column_axis:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be

if self.obj.ndim != 2:
    return

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i changed to

if self.ndim != 2
    return

@TomAugspurger
Copy link
Contributor

@howsiwei some CI failures now, if you have a chance to look.

@howsiwei
Copy link
Contributor Author

howsiwei commented Jan 9, 2020

@TomAugspurger I'm not sure how to fix it. See #29334 (comment)

@TomAugspurger
Copy link
Contributor

Thanks for the update. I'm not sure either :/

@TomAugspurger TomAugspurger modified the milestones: 1.0, 1.1 Jan 9, 2020
@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

@howsiwei is this still active? Can you fix merge conflicts and try to get CI green?

Also want to move whatsnew to 1.1.0 at this point

@howsiwei
Copy link
Contributor Author

@WillAyd I'm not sure how to proceed due to #29334 (comment). I can get the CI green with my previous approach (

def _ensure_listlike_indexer(self, key, is_indexer_key: bool):
) of having different checks depending on whether the DataFrame is set by .loc[] or [] but that was advised against by @jreback.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2020

@WillAyd I'm not sure how to proceed due to #29334 (comment). I can get the CI green with my previous approach (

def _ensure_listlike_indexer(self, key, is_indexer_key: bool):

) of having different checks depending on whether the DataFrame is set by .loc[] or [] but that was advised against by @jreback.

you can pass axis

@simonjayhawkins
Copy link
Member

@howsiwei can you merge master to resolve conflicts

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2020

@howsiwei can you fix up merge conflict? Otherwise should be good to merge here

@jreback
Copy link
Contributor

jreback commented Mar 3, 2020

thanks @howsiwei lgtm.

@jbrockmendel if any comments.


Parameters
----------
key : _LocIndexer key or list-like of column labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think key being a _LocIndexer object makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? _ensure_listlike_indexer is called at

self._ensure_listlike_indexer(key)

----------
key : _LocIndexer key or list-like of column labels
Target labels.
axis : key axis if known
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like axis isnt used here, is it necessary?

Copy link
Contributor Author

@howsiwei howsiwei Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's used at

axis == column_axis

For rationale of coding this way, see #29334 (comment).

@@ -2685,6 +2685,7 @@ def _setitem_array(self, key, value):
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
else:
self.loc._ensure_listlike_indexer(key, axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like you're also calling this inside _get_setitem_indexer; do you need to also call it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_setitem_indexer is not called in this code path


if (
axis == column_axis
and not isinstance(self.obj._get_axis(column_axis), ABCMultiIndex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.obj._get_axis(column_axis) --> self.obj.columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed

and not isinstance(self.obj._get_axis(column_axis), ABCMultiIndex)
and is_list_like_indexer(key)
and not com.is_bool_indexer(key)
and all(is_hashable(k) for k in key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the hashable check you can skip and the calls below will take care of it

Copy link
Contributor Author

@howsiwei howsiwei Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no hashable check is performed here, some test cases fail , eg.

def test_setitem_ndarray_3d(self, index, obj, idxr, idxr_id):
because different errors are raised compared to the original code when some keys are not hashable.

@jreback
Copy link
Contributor

jreback commented Mar 11, 2020

@howsiwei this looks good. can you merge master and ping on green (just to make sure)

@howsiwei
Copy link
Contributor Author

@jreback merged master. It's green now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howsiwei

all looks good, but need to move the whatsnew note. ping on green. sorry this has taken a while, we are at the finish line!

@@ -267,6 +267,35 @@ Indexing
- Bug in :meth:`Series.loc` and :meth:`DataFrame.loc` when indexing with an integer key on a object-dtype :class:`Index` that is not all-integers (:issue:`31905`)
- Bug in :meth:`DataFrame.iloc.__setitem__` on a :class:`DataFrame` with duplicate columns incorrectly setting values for all matching columns (:issue:`15686`, :issue:`22036`)

Assignment to multiple columns of a DataFrame when some columns do not exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to the section where we od api-breaking changes.

@jreback jreback merged commit 810a4e5 into pandas-dev:master Mar 14, 2020
@jreback
Copy link
Contributor

jreback commented Mar 14, 2020

thanks @howsiwei very nice
thanks for sticking with it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assignment to multiple columns only works if they existed before
7 participants