BUG: assignment to multiple columns when some column do not exist #29334

howsiwei · 2019-11-02T03:19:42Z

closes Assignment to multiple columns only works if they existed before #13658
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Previous PR: #26534

In particular, the following code now behaves correctly.

import pandas as pd
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
df[['a', 'c']] = 1
print(df)

v0.22: error

master: column 'c' is converted to index -1, which causes last column to be overwritten.

After this PR:

jreback

tests look good and conceptually ideas are find, but need to add the code in reasonable places w/o dealing with complexity of core/indexing.py

some suggestions in-line

pandas/core/indexing.py

jreback · 2019-11-02T20:54:02Z

pandas/core/frame.py

@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value):
                for k1, k2 in zip(key, value.columns):
                    self[k1] = value[k2]
            else:
+                if all(is_hashable(k) for k in key):


so can you add this in a routine on self.loc

call it
_ensure_listlike_indexers(key, axis=1)

Currently this code always assume that axis = 1. Do we need to allow adding multiple rows via assignment too?

jreback · 2019-12-01T23:05:07Z

pandas/core/indexing.py

@@ -165,7 +166,28 @@ def _get_loc(self, key: int, axis: int):
    def _slice(self, obj, axis: int, kind=None):
        return self.obj._slice(obj, axis=axis, kind=kind)

+    def _ensure_listlike_indexer(self, key, axis):


pls type these arguments & add a doc-string. make sure to indicate that this is a mutating operation.

jreback · 2019-12-01T23:05:55Z

pandas/core/indexing.py

+        if (
+            self.name == "loc"  # column is indexed by name
+            and isinstance(key, tuple)
+            and len(key) >= 2  # key is at least 2-dimensional


you have a lot of condictions here, when are you trying to ensure this is being called?

self.name == "loc" ensures that this indexer is is indexed by name (yes for loc, no for ix).
isinstance(key, tuple) and len(key) >= 2 filter out indexers with only 1 dimension.
is_list_like_indexer(key[1]) ensures that the indexer is indexed by multiple columns.
not com.is_bool_indexer(key[1]) ensures that the indexer is not a boolean indexer.

then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).

bigger questions, is why is this needed if you are already doing this for frames above.

Exactly which test hits this?

then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).

Done.

bigger questions, is why is this needed if you are already doing this for frames above.

Exactly which test hits this?

See below.

jreback · 2019-12-01T23:07:28Z

pandas/tests/frame/indexing/test_indexing.py

+    )
+    def test_setitem_list_missing_columns(self, float_frame, columns, box):
+        # GH 26534
+        result = float_frame.copy()


I would rather you not construct the expected frame like this, rather include it as another argument in the parameterization; essentially hard code it.

jreback · 2019-12-10T13:16:54Z

pandas/core/indexing.py

@@ -165,7 +166,37 @@ def _get_loc(self, key: int, axis: int):
    def _slice(self, obj, axis: int, kind=None):
        return self.obj._slice(obj, axis=axis, kind=kind)

+    def _ensure_listlike_indexer(self, key, axis: int):


type key (ok for a followon)

Not sure how to type key as it can be any listlike type. I was just following the type of _get_listlike_indexer which leaves key untyped too.

well, it would be really great to type the indexing keys

Indexable=Iterable might be enough for now (you can add/import from pandas._typing)

can this be a scalar?

can this be a scalar?

No.

jreback · 2019-12-10T13:19:07Z

pandas/core/indexing.py

+        if (
+            self.name == "loc"  # column is indexed by name
+            and isinstance(key, tuple)
+            and len(key) >= 2  # key is at least 2-dimensional


then do this in the _LocIndexer (and call super); we don't want to dispatch on self.name like this (and .ix is going away shortly anyways).

jreback · 2019-12-10T13:19:48Z

pandas/core/indexing.py

+        if (
+            self.name == "loc"  # column is indexed by name
+            and isinstance(key, tuple)
+            and len(key) >= 2  # key is at least 2-dimensional


bigger questions, is why is this needed if you are already doing this for frames above.

Exactly which test hits this?

pandas/tests/indexing/test_loc.py

howsiwei · 2019-12-17T10:59:38Z

@jreback any more changes required?

jreback · 2019-12-17T13:49:50Z

pandas/core/indexing.py

@@ -1733,6 +1734,37 @@ def _getitem_axis(self, key, axis: int):
        self._validate_key(key, axis)
        return self._get_label(key, axis=axis)

+    def _ensure_listlike_indexer(self, key: Iterable, axis: int):


this by-definition should always be on axis=1, yes? i would remove that as an argument and/or be very explicit about this.

this by-definition should always be on axis=1, yes?

Yes, axis is always 1.

i would remove that as an argument and/or be very explicit about this.

How to be explicit? Document it in docstring?

pandas/core/indexing.py

jreback

@howsiwei thanks for sticking with it here. indexing is already very complicated, I am trying to ensure your change doesn't make it even more complicated. a few more comments.

jreback · 2019-12-27T16:42:29Z

pandas/core/indexing.py

+            Whether key is a _LocIndexer key
+        """
+        column_axis = 1
+        if is_indexer_key:


do you really need is_indexer_key? this seems very odd, what exactly happens if you remove this keyword (specifically what about the below code is a problem)

Well we need some way to differentiate whether _ensure_listlike_indexer is called from DataFrame::_setitem_array or _LocIndexer::_get_setitem_indexer because keys are in different formats depending on the caller. I have briefly explained this at #29334 (comment).

Notice that at if is_indexer_key is true, then I assign

pandas/pandas/core/indexing.py

Line 1754 in 8ccd2b8

key = key[column_axis]

after some checks. So is_indexer_key is absolutely essential and the code below the line doesn't make sense if is_indexer_key is removed.

i think you could remove it if you also check is_list_like(key).

i just find this having to separate the keywords very hard to follow.

is_list_like won't work as it returns True for tuple and Series too.

The problem is that _LocIndexer accepts many possible types so it's hard to tell whether key is from _LocIndexer using the type of key alone.

just exclude tuples

you can use self.name == 'loc' if you just want to handle that case.

@jreback Currently an error is caused by the difference between df and df.loc. Eg.

>>> df = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=["a", "b"]) >>> df a b a 1 3 b 2 4 >>> df[["a"]] a a 1 b 2 >>> df.loc[["a"]] a b a 1 3

It seems impossible to resolve this without explicitly indicating where key comes from?

@jreback any suggestions?

jreback · 2019-12-27T16:42:45Z

pandas/core/indexing.py

+        column_axis = 1
+        if is_indexer_key:
+            if not (
+                isinstance(key, tuple)


can you put a blank line between conditions where you add a comment for each one

Like this?

if not ( isinstance(key, tuple) # key is at least 2-dimensional and len(key) >= 2 # key indexes multiple columns and is_list_like_indexer(key[column_axis]) and not com.is_bool_indexer(key[column_axis]) ):

yes that is better

jreback · 2019-12-27T16:43:44Z

pandas/core/indexing.py

+                    self.obj[k] = np.nan
+
+    def _get_setitem_indexer(self, key):
+        self._ensure_listlike_indexer(key, is_indexer_key=True)


ideally can you move this to the base class (so don't need to override here)

Where should overriding happen then? Define _ensure_listlike_indexer in DataFrame class too and override _ensure_listlike_indexer in _LocIndexer?

no, I mean move to _NDFrameIndexer (as opposed to _LocIndexer), then you won't have to override, you can just always call it in _get_setitem_indexer; that is the goal here.

we want to avoid special casing in the sub-classes of indexers, and allow _ensure_listlike_indxer to be called always (it just won't do anything if its conditions are not met)

close on this, thanks for working on it!

Sorry, it's a typo. I wanted to say _NDFrameIndexer instead of DataFrame.

How can we ensure that conditions are only met if indexer is a _LocIndexer?

jreback · 2020-01-01T16:11:15Z

pandas/core/indexing.py

+        column_axis = 1
+        if is_indexer_key:
+            if not (
+                isinstance(key, tuple)


yes that is better

jreback · 2020-01-01T16:11:57Z

pandas/core/indexing.py

+            ):
+                return
+            key = key[column_axis]
+


add a comment explaning this

jreback · 2020-01-01T16:12:43Z

pandas/core/indexing.py

+            Whether key is a _LocIndexer key
+        """
+        column_axis = 1
+        if is_indexer_key:


i think you could remove it if you also check is_list_like(key).

i just find this having to separate the keywords very hard to follow.

jreback · 2020-01-01T16:15:06Z

pandas/core/indexing.py

+                    self.obj[k] = np.nan
+
+    def _get_setitem_indexer(self, key):
+        self._ensure_listlike_indexer(key, is_indexer_key=True)


no, I mean move to _NDFrameIndexer (as opposed to _LocIndexer), then you won't have to override, you can just always call it in _get_setitem_indexer; that is the goal here.

we want to avoid special casing in the sub-classes of indexers, and allow _ensure_listlike_indxer to be called always (it just won't do anything if its conditions are not met)

close on this, thanks for working on it!

doc/source/whatsnew/v1.0.0.rst

jreback · 2020-01-03T01:58:52Z

pandas/core/indexing.py

+            Whether key is a _LocIndexer key
+        """
+        column_axis = 1
+        if is_indexer_key:


just exclude tuples

you can use self.name == 'loc' if you just want to handle that case.

jreback · 2020-01-06T13:31:37Z

pls rebase as well

jreback

lgtm! ping on green.

jreback · 2020-01-09T03:15:26Z

pandas/core/indexing.py

+        column_axis = 1
+
+        # check if self.obj is at least 2-dimensional
+        if len(self.obj.shape) <= column_axis:


shouldn't this be

if self.obj.ndim != 2: return

?

i changed to

if self.ndim != 2 return

TomAugspurger · 2020-01-09T15:14:54Z

@howsiwei some CI failures now, if you have a chance to look.

howsiwei · 2020-01-09T15:23:00Z

@TomAugspurger I'm not sure how to fix it. See #29334 (comment)

TomAugspurger · 2020-01-09T19:24:53Z

Thanks for the update. I'm not sure either :/

WillAyd · 2020-02-12T15:59:55Z

@howsiwei is this still active? Can you fix merge conflicts and try to get CI green?

Also want to move whatsnew to 1.1.0 at this point

howsiwei · 2020-02-14T09:06:54Z

@WillAyd I'm not sure how to proceed due to #29334 (comment). I can get the CI green with my previous approach (

pandas/pandas/core/indexing.py

Line 1731 in 8ccd2b8

def _ensure_listlike_indexer(self, key, is_indexer_key: bool):

) of having different checks depending on whether the DataFrame is set by .loc[] or [] but that was advised against by @jreback.

jreback · 2020-02-14T11:43:22Z

@WillAyd I'm not sure how to proceed due to #29334 (comment). I can get the CI green with my previous approach (

pandas/pandas/core/indexing.py

Line 1731 in 8ccd2b8

def _ensure_listlike_indexer(self, key, is_indexer_key: bool):

) of having different checks depending on whether the DataFrame is set by .loc[] or [] but that was advised against by @jreback.

you can pass axis

simonjayhawkins · 2020-02-15T21:30:04Z

@howsiwei can you merge master to resolve conflicts

WillAyd · 2020-02-27T23:10:39Z

@howsiwei can you fix up merge conflict? Otherwise should be good to merge here

jreback · 2020-03-03T03:08:42Z

thanks @howsiwei lgtm.

@jbrockmendel if any comments.

jbrockmendel · 2020-03-04T03:27:32Z

pandas/core/indexing.py

+
+        Parameters
+        ----------
+        key : _LocIndexer key or list-like of column labels


i dont think key being a _LocIndexer object makes sense

Why? _ensure_listlike_indexer is called at

pandas/pandas/core/indexing.py

Line 586 in d7184e7

self._ensure_listlike_indexer(key)

jbrockmendel · 2020-03-04T03:28:54Z

pandas/core/indexing.py

+        ----------
+        key : _LocIndexer key or list-like of column labels
+            Target labels.
+        axis : key axis if known


it looks like axis isnt used here, is it necessary?

Yes, it's used at

pandas/pandas/core/indexing.py

Line 642 in d7184e7

axis == column_axis

For rationale of coding this way, see #29334 (comment).

jbrockmendel · 2020-03-04T03:30:01Z

pandas/core/frame.py

@@ -2685,6 +2685,7 @@ def _setitem_array(self, key, value):
                for k1, k2 in zip(key, value.columns):
                    self[k1] = value[k2]
            else:
+                self.loc._ensure_listlike_indexer(key, axis=1)


it looks like you're also calling this inside _get_setitem_indexer; do you need to also call it here?

_get_setitem_indexer is not called in this code path

jbrockmendel · 2020-03-04T03:31:48Z

pandas/core/indexing.py

+
+        if (
+            axis == column_axis
+            and not isinstance(self.obj._get_axis(column_axis), ABCMultiIndex)


self.obj._get_axis(column_axis) --> self.obj.columns

Ok, changed

jbrockmendel · 2020-03-04T03:32:30Z

pandas/core/indexing.py

+            and not isinstance(self.obj._get_axis(column_axis), ABCMultiIndex)
+            and is_list_like_indexer(key)
+            and not com.is_bool_indexer(key)
+            and all(is_hashable(k) for k in key)


i think the hashable check you can skip and the calls below will take care of it

If no hashable check is performed here, some test cases fail , eg.

pandas/pandas/tests/indexing/test_indexing.py

Line 112 in d7184e7

def test_setitem_ndarray_3d(self, index, obj, idxr, idxr_id):

because different errors are raised compared to the original code when some keys are not hashable.

jreback · 2020-03-11T18:30:47Z

@howsiwei this looks good. can you merge master and ping on green (just to make sure)

howsiwei · 2020-03-14T11:31:13Z

@jreback merged master. It's green now.

jreback

@howsiwei

all looks good, but need to move the whatsnew note. ping on green. sorry this has taken a while, we are at the finish line!

jreback · 2020-03-14T15:47:19Z

doc/source/whatsnew/v1.1.0.rst

@@ -267,6 +267,35 @@ Indexing
 - Bug in :meth:`Series.loc` and :meth:`DataFrame.loc` when indexing with an integer key on a object-dtype :class:`Index` that is not all-integers (:issue:`31905`)
 - Bug in :meth:`DataFrame.iloc.__setitem__` on a :class:`DataFrame` with duplicate columns incorrectly setting values for all matching columns (:issue:`15686`, :issue:`22036`)

+Assignment to multiple columns of a DataFrame when some columns do not exist


move this to the section where we od api-breaking changes.

jreback · 2020-03-14T20:31:48Z

thanks @howsiwei very nice
thanks for sticking with it!

…ndas-dev#29334)

doc/source/whatsnew/v1.1.0.rst

howsiwei mentioned this pull request Nov 2, 2019

BUG: assignment to multiple columns when some column do not exist #26534

Closed

4 tasks

howsiwei force-pushed the assign branch 3 times, most recently from 6e4fe48 to 26bd7cd Compare November 2, 2019 06:02

jreback requested changes Nov 2, 2019

View reviewed changes

gfyoung added the Indexing Related to indexing on series/frames, not to indexes themselves label Nov 5, 2019

jreback requested changes Dec 1, 2019

View reviewed changes

jreback requested changes Dec 10, 2019

View reviewed changes

jreback requested changes Dec 17, 2019

View reviewed changes

jreback added this to the 1.0 milestone Dec 27, 2019

jreback added the Bug label Dec 27, 2019

jreback requested changes Dec 27, 2019

View reviewed changes

jreback requested changes Jan 1, 2020

View reviewed changes

jreback requested changes Jan 3, 2020

View reviewed changes

jreback approved these changes Jan 9, 2020

View reviewed changes

TomAugspurger modified the milestones: 1.0, 1.1 Jan 9, 2020

howsiwei force-pushed the assign branch from b126b75 to bb05060 Compare February 19, 2020 16:55

howsiwei added 2 commits February 29, 2020 15:33

Add tests for setting missing columns

25ad422

Fix assignment to missing columns

b8d3c48

howsiwei added 2 commits February 29, 2020 15:38

Add whatsnew

660d0f2

Pass axis to _ensure_listlike_indexer

d7184e7

howsiwei force-pushed the assign branch from bb05060 to d7184e7 Compare February 29, 2020 07:40

jbrockmendel reviewed Mar 4, 2020

View reviewed changes

Use DataFrame.columns

a509eb5

jreback mentioned this pull request Mar 11, 2020

BUG: Fix issue with datetime[ns, tz] input in Block.setitem GH32395 #32479

Merged

5 tasks

Merge remote-tracking branch 'upstream/master' into assign

e908282

jreback reviewed Mar 14, 2020

View reviewed changes

Update documentation

26ba2a9

jreback merged commit 810a4e5 into pandas-dev:master Mar 14, 2020

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

BUG: assignment to multiple columns when some column do not exist (pa…

d392d4f

…ndas-dev#29334)

jbrockmendel reviewed Mar 24, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved

jorisvandenbossche mentioned this pull request Jul 8, 2020

Check API changes in whatsnew #34801

Closed

simonjayhawkins mentioned this pull request Nov 12, 2020

BUG: errors with .loc[(slice(...), ), ] when modifying a subset of rows in a pandas dataframe/series in 1.1.4 #37711

Closed

Uh oh!

BUG: assignment to multiple columns when some column do not exist #29334

BUG: assignment to multiple columns when some column do not exist #29334

Uh oh!

Conversation

howsiwei commented Nov 2, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

howsiwei commented Dec 17, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howsiwei Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howsiwei Jan 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howsiwei Jan 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howsiwei Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

howsiwei Dec 18, 2019 •

edited

Loading

howsiwei Jan 2, 2020 •

edited

Loading

howsiwei Jan 9, 2020 •

edited

Loading

howsiwei Dec 31, 2019 •

edited

Loading