BUG: assignment to multiple columns when some column do not exist #26534

howsiwei · 2019-05-27T05:21:08Z

closes Assignment to multiple columns only works if they existed before #13658
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In particular, the following code now behaves correctly.

import pandas as pd
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
df[['a', 'c']] = 1
print(df)

v0.22: error

master: column 'c' is converted to index -1, which causes last column to be overwritten.

After this PR:

codecov · 2019-05-27T05:59:23Z

Codecov Report

Merging #26534 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26534      +/-   ##
==========================================
- Coverage   91.76%   91.76%   -0.01%     
==========================================
  Files         174      174              
  Lines       50629    50632       +3     
==========================================
- Hits        46462    46461       -1     
- Misses       4167     4171       +4

Flag	Coverage Δ
#multiple	`90.3% <100%> (ø)`	⬆️
#single	`41.68% <0%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.01% <100%> (-0.12%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a516c1...7bc1856. Read the comment docs.

codecov · 2019-05-27T05:59:23Z

Codecov Report

❗ No coverage uploaded for pull request base (master@24bd67e). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master   #26534   +/-   ##
=========================================
  Coverage          ?   91.87%           
=========================================
  Files             ?      174           
  Lines             ?    50701           
  Branches          ?        0           
=========================================
  Hits              ?    46581           
  Misses            ?     4120           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.41% <100%> (?)`
#single	`41.79% <57.69%> (?)`

Impacted Files	Coverage Δ
pandas/core/frame.py	`97% <100%> (ø)`
pandas/core/indexing.py	`93.6% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24bd67e...3054b6d. Read the comment docs.

pep8speaks · 2019-05-27T09:55:37Z

Hello @howsiwei! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-18 09:49:14 UTC

howsiwei · 2019-05-27T10:09:49Z

@jreback any feedback?

simonjayhawkins

@howsiwei Thanks for the PR.

IIUC the issue relates to the scalar assignment to multiple columns of a DataFrame using setitem only.

test_loc_setitem_corner in pandas\tests\series\indexing\test_loc.py is currently failing. Changing the behavior of Series assignment is not covered in the scope of the issue.

pandas/tests/frame/test_indexing.py

jreback

this is adding quite a bit of code; please try to streamline and not special case

jreback · 2019-05-27T14:45:58Z

doc/source/whatsnew/v0.25.0.rst

@@ -425,6 +425,7 @@ Indexing
 - Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`).
 - Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`)
 - Allow keyword arguments for callable local reference used in the :method:`DataFrame.query` string (:issue:`26426`)
+- Bug in assignment to multiple columns of a `DataFrame` when some of the columns do not exist (:issue:`13658`)


use :class:`DataFrame` ; this is assignment with a scalar, yes?

This should allow the same set of values as when the columns are present, eg. scalar, 1d array, 2d array, dataframe.

use :class:`DataFrame`

can you update.

howsiwei · 2019-05-28T01:35:02Z

Hi, thanks for the feedback. I have updated to a much cleaner code.

pandas/tests/frame/test_indexing.py

simonjayhawkins

tests are much easier to read. thanks.

a few minor comments on the tests

we are also going to need another test for scalar assignment with some missing columns to cover the other issue. unless we have one.

do we have a test to check these cases raise a warning on getting after this change?

I think we should also have a test to now check that the message is not raised when setting (or do we have a test for this?)

not looked at the actual code, too many tests are currently failing.

simonjayhawkins · 2019-06-01T17:14:54Z

doc/source/whatsnew/v0.25.0.rst

@@ -425,6 +425,7 @@ Indexing
 - Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`).
 - Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`)
 - Allow keyword arguments for callable local reference used in the :method:`DataFrame.query` string (:issue:`26426`)
+- Bug in assignment to multiple columns of a `DataFrame` when some of the columns do not exist (:issue:`13658`)


use :class:`DataFrame`

can you update.

pandas/tests/frame/test_indexing.py

jreback · 2019-06-03T11:55:19Z

pandas/core/indexing.py

-            labels = item_labels[info_idx]
+            if has_missing_columns:
+                labels = [idx['key'] if isinstance(idx, dict) else
+                          item_labels[idx]


@howsiwei you are adding WAY too much complexity here. You need to see the minimal change set to make this change work, it is likely way simpler.

trace other setter calls and see

@jreback mind suggesting where I should make the changes? The code for setting values is quite complicated and it's not obvious to me at all how to achieve the desired result with fewer changes.

@jreback since value can have multiple types (eg. scalar, 1d array, 2d array, DataFrame), it's not easy to split value into value for each column. The only way I see to not duplicate the value splitting code is to modify the helper function setter, where value has already been split. But I don't see how the changes can be made much simpler when modifying this way.

@jreback any updates?

@jreback thanks for the suggestion. I'm guessing what you mean is that we could simply insert the column before setting. There are 2 downsides to this approach though:

It may be less efficient.

More importantly, currently pandas would try to infer the fill values for a column. Consider the following code:

import pandas as pd df = pd.DataFrame({'a': [0, 0]}) df.loc[0, 'b'] = pd.Timestamp('20120101') df.loc[0, 'c'] = 1.0 print(df)

Output:

a b c 0 0 2012-01-01 00:00:00 1.0 1 0 NaT NaN

Note that column b and c are filled with NaT and NaN respectively.

To infer the fill values for each column, we need to split the values into separated columns first. But this splitting is only done after line 465.

we already do this for expansions

my point is you are introducing another code path here

so either remove and fix the existing one (non trivial) or integrate this change

the dtype changes are expected in the current implementation

we already do this for expansions

Do what?

the dtype changes are expected in the current implementation

What do you mean? Where does the dtype change?

@jreback Hmm after reading this again I think I understand what you said. Do you mean that using NaN as fill values for all columns is fine because this is already done elsewhere in pandas code? What are expansions though?

@jreback can you please confirm if I should just fix this issue by inserting columns filled with NaN first?

jreback · 2019-07-02T00:48:50Z

pandas/tests/frame/test_indexing.py

-                   r" \[columns\]")
-            with pytest.raises(KeyError, match=msg):
-                self.frame.ix[:, ['E']] = 1
+            # partial setting now allows this GH13658


don;t comment out, simply remove

I commented out the test following the test immediately below on line 1136. Should it be removed too?

yes try that

yes try that

Yes for this question too?

@jreback can you please confirm if I should just fix this issue by inserting columns filled with NaN first?

jreback · 2019-07-02T00:54:42Z

pandas/core/indexing.py

-            labels = item_labels[info_idx]
+            if has_missing_columns:
+                labels = [idx['key'] if isinstance(idx, dict) else
+                          item_labels[idx]


so this is what I would do. I would touch no code here at all; rather add what you need into the loop above around 328. This handles the missing indexer case.

Then we can assert that we have no missing items at all, IOW what you have at line 307 (what you define has_missing columns), could be true there but MUST be false at line 465.

This entire setting routine is very complex. We want to make this simpler and easier to understand; that may take a bigger refactor, but let's start by not adding additional complexity as much as possible.

i know this is non-trivial, but I am holding the line here because this is already pretty impenetrable code.

love for you to make it better!

howsiwei · 2019-07-23T02:07:46Z

@jreback what do you think about the current changes? Also do you have any idea why It's failing on Linux py37_np_dev?

jreback · 2019-07-23T12:21:35Z

pandas/core/indexing.py

-                return self._get_listlike_indexer(obj, axis, **kwargs)[1]
+                try:
+                    # When setting, missing keys are not allowed, even with # .loc:
+                    kwargs = {"raise_missing": True if is_setter else raise_missing}


you don't want to do this in a try/except as its extremely error prone. Instead if you have a is_setter you can just make sure the columns exist (but doing the _convert_to_indexer).

I think if you do this first e.g. almost first in _setitem_with_indexer you won't need to make the code change you have above about taking the split path.

I force take_split_path = True to work around #27583. Otherwise I would fail tests such as test_setitem_list_all_missing_columns_scalar in pandas/tests/frame/test_indexing.py.

jreback · 2019-07-23T12:23:02Z

pandas/tests/frame/test_indexing.py

@@ -208,6 +208,43 @@ def test_setitem_list_of_tuples(self, float_frame):
        expected = Series(tuples, index=float_frame.index, name="tuples")
        assert_series_equal(result, expected)

+    def test_setitem_list_all_missing_columns_scalar(self, float_frame):


can you parameterize these, typicall we add an arg called box to do this, eg.

@pytest.mark.parameterize('box', [lambda x: 1, lambda x: [1, 2], lambda x: x[['B', 'C']]]) ...

I'm not sure how to get the expected value for different right hand side. Furthermore, how do I parametrize the left-hand side?

jreback · 2019-07-23T12:23:12Z

pandas/tests/indexing/test_loc.py

@@ -808,6 +808,46 @@ def test_loc_setitem_with_scalar_index(self, indexer, value):

        assert is_scalar(result) and result == "Z"

+    def test_loc_setitem_missing_columns_scalar_index_list_value(self):


same as above

jreback · 2019-07-23T12:23:51Z

doc/source/whatsnew/v0.25.1.rst

@@ -85,7 +85,7 @@ Interval
 Indexing
 ^^^^^^^^

-
+- Bug in assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist (:issue:`13658`)


this will need a subsection to explain the change. move to 1.0 as this is a non-trivial change of a longstanding issue.

Subsection added.

howsiwei · 2019-08-18T07:21:52Z

@jreback I have moved my entire fix to earlier stage of setting as #27604 remove is_setter which is required in my earlier approach. What do you think about it?

TomAugspurger

Jeff may be offline for a bit. We'll get to this before 1.0 though.

TomAugspurger · 2019-08-20T14:15:03Z

pandas/core/frame.py

@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value):
                for k1, k2 in zip(key, value.columns):
                    self[k1] = value[k2]
            else:
+                if all(is_hashable(k) for k in key):


Under what cases can elements of key not be hashable? These are our eventual column names, right? Which I think we require they be hashable.

I'm a bit concerned about the performance here, for the case when key is large (which is possible, right?)

The only reason that I check whether key is hashable is to produce the desired errors, which is tested in

pandas/pandas/tests/indexing/test_indexing.py

Lines 180 to 184 in e55b698

with pytest.raises(

(ValueError, AttributeError, TypeError, pd.core.indexing.IndexingError),

match=msg,

):

idxr[nd3] = 0

If producing all these errors are not required then this check can be omitted.

@TomAugspurger what's your thought on this?

What happens if the check is removed?

The error produced is different.

What's the error?

Yea this seems strange to me. I feel like this should be hashable inherently so clarification here would be helpful

Without testing is_hashable,

pandas/pandas/tests/indexing/test_indexing.py

Line 184 in 3c0cf22

idxr[nd3] = 0

may throw Buffer has wrong number of dimensions (expected 1, got 2) or Array conditional must be same shape as self instead of the expected errors.

WillAyd · 2019-09-05T17:12:06Z

doc/source/whatsnew/v1.0.0.rst

+Assignment to multiple columns of a DataFrame when some columns do not exist
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`)


Suggested change

Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`)

Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed with the right values. (:issue:`13658`)

WillAyd · 2019-09-05T17:13:18Z

pandas/core/frame.py

@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value):
                for k1, k2 in zip(key, value.columns):
                    self[k1] = value[k2]
            else:
+                if all(is_hashable(k) for k in key):


Yea this seems strange to me. I feel like this should be hashable inherently so clarification here would be helpful

WillAyd · 2019-09-05T17:14:04Z

pandas/core/indexing.py

@@ -197,6 +198,19 @@ def _get_setitem_indexer(self, key):
    def __setitem__(self, key, value):
        if isinstance(key, tuple):
            key = tuple(com.apply_if_callable(x, self.obj) for x in key)
+            if (


Maybe this gets simplified if the hashable stuff can be removed, but can you add a comment here as to what is going on? Not immediately apparent what all of these conditions are

WillAyd · 2019-09-05T17:16:40Z

pandas/tests/frame/test_indexing.py

+    def test_setitem_list_missing_columns(self, float_frame, columns, box):
+        # GH 26534
+        result = float_frame.copy()
+        result[columns] = box(float_frame)


Hmm not really sure what the point of this parametrization is - can you construct test(s) that don't overwrite the existing column data before the assignment? Seems unrelated unless I am missing something

I parametrize the tests as #26534 (comment) suggest. What do you propose?

jreback · 2019-09-08T17:16:23Z

@howsiwei have been away for a while, can you merge master and i'll look again.

jreback · 2019-10-06T23:38:38Z

@howsiwei can you merge master and i'll have another look

WillAyd · 2019-10-22T02:04:40Z

closing as stale but @howsiwei ping if you'd like to pick this backup

howsiwei · 2019-11-01T09:47:00Z

@WillAyd I'll pick this back up in these few days.

WillAyd · 2019-11-01T14:53:22Z

Sounds good. I don't have the option to reopen though - did you delete the branch on GitHub? If so unfortunately might need to push a new PR

howsiwei · 2019-11-01T15:00:01Z

I suspect it's because I rebase the branch on latest master.

howsiwei · 2019-11-02T03:20:22Z

I have created a new PR #29334.

howsiwei force-pushed the assign branch 2 times, most recently from 06ee05b to b732bf9 Compare May 27, 2019 09:55

howsiwei force-pushed the assign branch from b732bf9 to bce48eb Compare May 27, 2019 10:04

howsiwei force-pushed the assign branch from bce48eb to 6cc0885 Compare May 27, 2019 10:11

simonjayhawkins reviewed May 27, 2019

View reviewed changes

pandas/tests/frame/test_indexing.py Outdated Show resolved Hide resolved

simonjayhawkins added the Indexing Related to indexing on series/frames, not to indexes themselves label May 27, 2019

jreback requested changes May 27, 2019

View reviewed changes

howsiwei force-pushed the assign branch 10 times, most recently from c9ff25a to d3190d3 Compare May 28, 2019 12:27

howsiwei changed the title ~~Fix assignment to multiple columns when some column do not exist~~ BUG: Fix assignment to multiple columns when some column do not exist May 28, 2019

howsiwei changed the title ~~BUG: Fix assignment to multiple columns when some column do not exist~~ BUG: fix assignment to multiple columns when some column do not exist May 28, 2019

howsiwei changed the title ~~BUG: fix assignment to multiple columns when some column do not exist~~ BUG: assignment to multiple columns when some column do not exist May 28, 2019

howsiwei force-pushed the assign branch 3 times, most recently from 98de71c to eddd29b Compare May 28, 2019 13:41

simonjayhawkins reviewed May 29, 2019

View reviewed changes

pandas/tests/frame/test_indexing.py Outdated Show resolved Hide resolved

simonjayhawkins requested changes Jun 1, 2019

View reviewed changes

jreback requested changes Jun 3, 2019

View reviewed changes

howsiwei force-pushed the assign branch from 9f00eea to ff9ac6a Compare June 5, 2019 04:19

jreback requested changes Jul 2, 2019

View reviewed changes

howsiwei force-pushed the assign branch from f038bc9 to 54107ac Compare July 23, 2019 09:44

jreback requested changes Jul 23, 2019

View reviewed changes

howsiwei force-pushed the assign branch 3 times, most recently from de154f1 to 7cbd30f Compare August 2, 2019 06:49

howsiwei force-pushed the assign branch 2 times, most recently from 692a290 to 41a560e Compare August 7, 2019 03:05

howsiwei force-pushed the assign branch 3 times, most recently from be1b5e3 to 414fe51 Compare August 14, 2019 12:50

Fix assignment to multiple columns when some column do not exist

5469912

howsiwei force-pushed the assign branch from 74d3a28 to 197db57 Compare August 18, 2019 07:26

Parametrize some tests

3622744

howsiwei force-pushed the assign branch from 197db57 to 3622744 Compare August 18, 2019 09:49

TomAugspurger reviewed Aug 20, 2019

View reviewed changes

WillAyd requested changes Sep 5, 2019

View reviewed changes

WillAyd closed this Oct 22, 2019

howsiwei mentioned this pull request Nov 2, 2019

BUG: assignment to multiple columns when some column do not exist #29334

Merged

5 tasks

		@@ -808,6 +808,46 @@ def test_loc_setitem_with_scalar_index(self, indexer, value):

		assert is_scalar(result) and result == "Z"

		def test_loc_setitem_missing_columns_scalar_index_list_value(self):

	with pytest.raises(
	(ValueError, AttributeError, TypeError, pd.core.indexing.IndexingError),
	match=msg,
	):
	idxr[nd3] = 0

	Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`)
	Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed with the right values. (:issue:`13658`)

BUG: assignment to multiple columns when some column do not exist #26534

BUG: assignment to multiple columns when some column do not exist #26534

Conversation

howsiwei commented May 27, 2019 • edited Loading

codecov bot commented May 27, 2019

Codecov Report

codecov bot commented May 27, 2019 • edited Loading

Codecov Report

pep8speaks commented May 27, 2019 • edited Loading

Comment last updated at 2019-08-18 09:49:14 UTC

howsiwei commented May 27, 2019

simonjayhawkins left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei commented May 28, 2019 • edited Loading

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei Jun 5, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei commented Jul 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei Jul 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei Jul 26, 2019 • edited Loading

Choose a reason for hiding this comment

howsiwei commented Aug 18, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei Aug 20, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

howsiwei Nov 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 8, 2019

jreback commented Oct 6, 2019

WillAyd commented Oct 22, 2019

howsiwei commented Nov 1, 2019

WillAyd commented Nov 1, 2019

howsiwei commented Nov 1, 2019

howsiwei commented Nov 2, 2019

howsiwei commented May 27, 2019 •

edited

Loading

codecov bot commented May 27, 2019 •

edited

Loading

pep8speaks commented May 27, 2019 •

edited

Loading

howsiwei May 28, 2019 •

edited

Loading

howsiwei commented May 28, 2019 •

edited

Loading

howsiwei Jun 5, 2019 •

edited

Loading

howsiwei Jul 2, 2019 •

edited

Loading

howsiwei Jul 2, 2019 •

edited

Loading

howsiwei Jul 25, 2019 •

edited

Loading

howsiwei Jul 26, 2019 •

edited

Loading

howsiwei Aug 20, 2019 •

edited

Loading

howsiwei Nov 1, 2019 •

edited

Loading