Skip to content

BUG: assignment to multiple columns when some column do not exist #26534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

howsiwei
Copy link
Contributor

@howsiwei howsiwei commented May 27, 2019

In particular, the following code now behaves correctly.

import pandas as pd
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
df[['a', 'c']] = 1
print(df)

v0.22: error

master: column 'c' is converted to index -1, which causes last column to be overwritten.

   a  b
0  1  1
1  1  1
2  1  1

After this PR:

   a  b  c
0  1  3  1
1  1  4  1
2  1  5  1

@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

Merging #26534 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26534      +/-   ##
==========================================
- Coverage   91.76%   91.76%   -0.01%     
==========================================
  Files         174      174              
  Lines       50629    50632       +3     
==========================================
- Hits        46462    46461       -1     
- Misses       4167     4171       +4
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 41.68% <0%> (-0.09%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.01% <100%> (-0.12%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a516c1...7bc1856. Read the comment docs.

@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@24bd67e). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #26534   +/-   ##
=========================================
  Coverage          ?   91.87%           
=========================================
  Files             ?      174           
  Lines             ?    50701           
  Branches          ?        0           
=========================================
  Hits              ?    46581           
  Misses            ?     4120           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.41% <100%> (?)
#single 41.79% <57.69%> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97% <100%> (ø)
pandas/core/indexing.py 93.6% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24bd67e...3054b6d. Read the comment docs.

@howsiwei howsiwei force-pushed the assign branch 2 times, most recently from 06ee05b to b732bf9 Compare May 27, 2019 09:55
@pep8speaks
Copy link

pep8speaks commented May 27, 2019

Hello @howsiwei! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-18 09:49:14 UTC

@howsiwei
Copy link
Contributor Author

@jreback any feedback?

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howsiwei Thanks for the PR.

IIUC the issue relates to the scalar assignment to multiple columns of a DataFrame using setitem only.

test_loc_setitem_corner in pandas\tests\series\indexing\test_loc.py is currently failing. Changing the behavior of Series assignment is not covered in the scope of the issue.

@simonjayhawkins simonjayhawkins added the Indexing Related to indexing on series/frames, not to indexes themselves label May 27, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is adding quite a bit of code; please try to streamline and not special case

@@ -425,6 +425,7 @@ Indexing
- Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`).
- Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`)
- Allow keyword arguments for callable local reference used in the :method:`DataFrame.query` string (:issue:`26426`)
- Bug in assignment to multiple columns of a `DataFrame` when some of the columns do not exist (:issue:`13658`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :class:`DataFrame` ; this is assignment with a scalar, yes?

Copy link
Contributor Author

@howsiwei howsiwei May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should allow the same set of values as when the columns are present, eg. scalar, 1d array, 2d array, dataframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :class:`DataFrame`

can you update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@howsiwei
Copy link
Contributor Author

howsiwei commented May 28, 2019

Hi, thanks for the feedback. I have updated to a much cleaner code.

@howsiwei howsiwei force-pushed the assign branch 10 times, most recently from c9ff25a to d3190d3 Compare May 28, 2019 12:27
@howsiwei howsiwei changed the title Fix assignment to multiple columns when some column do not exist BUG: Fix assignment to multiple columns when some column do not exist May 28, 2019
@howsiwei howsiwei changed the title BUG: Fix assignment to multiple columns when some column do not exist BUG: fix assignment to multiple columns when some column do not exist May 28, 2019
@howsiwei howsiwei changed the title BUG: fix assignment to multiple columns when some column do not exist BUG: assignment to multiple columns when some column do not exist May 28, 2019
@howsiwei howsiwei force-pushed the assign branch 3 times, most recently from 98de71c to eddd29b Compare May 28, 2019 13:41
Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests are much easier to read. thanks.

a few minor comments on the tests

we are also going to need another test for scalar assignment with some missing columns to cover the other issue. unless we have one.

do we have a test to check these cases raise a warning on getting after this change?

I think we should also have a test to now check that the message is not raised when setting (or do we have a test for this?)

not looked at the actual code, too many tests are currently failing.

@@ -425,6 +425,7 @@ Indexing
- Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`).
- Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`)
- Allow keyword arguments for callable local reference used in the :method:`DataFrame.query` string (:issue:`26426`)
- Bug in assignment to multiple columns of a `DataFrame` when some of the columns do not exist (:issue:`13658`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :class:`DataFrame`

can you update.

labels = item_labels[info_idx]
if has_missing_columns:
labels = [idx['key'] if isinstance(idx, dict) else
item_labels[idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howsiwei you are adding WAY too much complexity here. You need to see the minimal change set to make this change work, it is likely way simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace other setter calls and see

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback mind suggesting where I should make the changes? The code for setting values is quite complicated and it's not obvious to me at all how to achieve the desired result with fewer changes.

Copy link
Contributor Author

@howsiwei howsiwei Jun 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback since value can have multiple types (eg. scalar, 1d array, 2d array, DataFrame), it's not easy to split value into value for each column. The only way I see to not duplicate the value splitting code is to modify the helper function setter, where value has already been split. But I don't see how the changes can be made much simpler when modifying this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback any updates?

Copy link
Contributor Author

@howsiwei howsiwei Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thanks for the suggestion. I'm guessing what you mean is that we could simply insert the column before setting. There are 2 downsides to this approach though:

  • It may be less efficient.
  • More importantly, currently pandas would try to infer the fill values for a column. Consider the following code:
import pandas as pd

df = pd.DataFrame({'a': [0, 0]})
df.loc[0, 'b'] = pd.Timestamp('20120101')
df.loc[0, 'c'] = 1.0
print(df)

Output:

   a                    b    c
0  0  2012-01-01 00:00:00  1.0
1  0                  NaT  NaN

Note that column b and c are filled with NaT and NaN respectively.

To infer the fill values for each column, we need to split the values into separated columns first. But this splitting is only done after line 465.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already do this for expansions

my point is you are introducing another code path here

so either remove and fix the existing one (non trivial) or integrate this change

the dtype changes are expected in the current implementation

Copy link
Contributor Author

@howsiwei howsiwei Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already do this for expansions

Do what?

the dtype changes are expected in the current implementation

What do you mean? Where does the dtype change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Hmm after reading this again I think I understand what you said. Do you mean that using NaN as fill values for all columns is fine because this is already done elsewhere in pandas code? What are expansions though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback can you please confirm if I should just fix this issue by inserting columns filled with NaN first?

r" \[columns\]")
with pytest.raises(KeyError, match=msg):
self.frame.ix[:, ['E']] = 1
# partial setting now allows this GH13658
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don;t comment out, simply remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented out the test following the test immediately below on line 1136. Should it be removed too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes try that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes try that

Yes for this question too?

@jreback can you please confirm if I should just fix this issue by inserting columns filled with NaN first?

labels = item_labels[info_idx]
if has_missing_columns:
labels = [idx['key'] if isinstance(idx, dict) else
item_labels[idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is what I would do. I would touch no code here at all; rather add what you need into the loop above around 328. This handles the missing indexer case.

Then we can assert that we have no missing items at all, IOW what you have at line 307 (what you define has_missing columns), could be true there but MUST be false at line 465.

This entire setting routine is very complex. We want to make this simpler and easier to understand; that may take a bigger refactor, but let's start by not adding additional complexity as much as possible.

i know this is non-trivial, but I am holding the line here because this is already pretty impenetrable code.

love for you to make it better!

@howsiwei
Copy link
Contributor Author

@jreback what do you think about the current changes? Also do you have any idea why It's failing on Linux py37_np_dev?

return self._get_listlike_indexer(obj, axis, **kwargs)[1]
try:
# When setting, missing keys are not allowed, even with # .loc:
kwargs = {"raise_missing": True if is_setter else raise_missing}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't want to do this in a try/except as its extremely error prone. Instead if you have a is_setter you can just make sure the columns exist (but doing the _convert_to_indexer).

I think if you do this first e.g. almost first in _setitem_with_indexer you won't need to make the code change you have above about taking the split path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I force take_split_path = True to work around #27583. Otherwise I would fail tests such as test_setitem_list_all_missing_columns_scalar in pandas/tests/frame/test_indexing.py.

@@ -208,6 +208,43 @@ def test_setitem_list_of_tuples(self, float_frame):
expected = Series(tuples, index=float_frame.index, name="tuples")
assert_series_equal(result, expected)

def test_setitem_list_all_missing_columns_scalar(self, float_frame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize these, typicall we add an arg called box to do this, eg.

@pytest.mark.parameterize('box', [lambda x: 1, lambda x: [1, 2], lambda x: x[['B', 'C']]])
...

Copy link
Contributor Author

@howsiwei howsiwei Jul 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to get the expected value for different right hand side. Furthermore, how do I parametrize the left-hand side?

@@ -808,6 +808,46 @@ def test_loc_setitem_with_scalar_index(self, indexer, value):

assert is_scalar(result) and result == "Z"

def test_loc_setitem_missing_columns_scalar_index_list_value(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -85,7 +85,7 @@ Interval
Indexing
^^^^^^^^

-
- Bug in assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist (:issue:`13658`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need a subsection to explain the change. move to 1.0 as this is a non-trivial change of a longstanding issue.

Copy link
Contributor Author

@howsiwei howsiwei Jul 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subsection added.

@howsiwei howsiwei force-pushed the assign branch 3 times, most recently from de154f1 to 7cbd30f Compare August 2, 2019 06:49
@howsiwei howsiwei force-pushed the assign branch 2 times, most recently from 692a290 to 41a560e Compare August 7, 2019 03:05
@howsiwei howsiwei force-pushed the assign branch 3 times, most recently from be1b5e3 to 414fe51 Compare August 14, 2019 12:50
@howsiwei
Copy link
Contributor Author

@jreback I have moved my entire fix to earlier stage of setting as #27604 remove is_setter which is required in my earlier approach. What do you think about it?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff may be offline for a bit. We'll get to this before 1.0 though.

@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value):
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
else:
if all(is_hashable(k) for k in key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what cases can elements of key not be hashable? These are our eventual column names, right? Which I think we require they be hashable.

I'm a bit concerned about the performance here, for the case when key is large (which is possible, right?)

Copy link
Contributor Author

@howsiwei howsiwei Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason that I check whether key is hashable is to produce the desired errors, which is tested in

with pytest.raises(
(ValueError, AttributeError, TypeError, pd.core.indexing.IndexingError),
match=msg,
):
idxr[nd3] = 0

If producing all these errors are not required then this check can be omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger what's your thought on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the check is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error produced is different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this seems strange to me. I feel like this should be hashable inherently so clarification here would be helpful

Copy link
Contributor Author

@howsiwei howsiwei Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without testing is_hashable,

may throw Buffer has wrong number of dimensions (expected 1, got 2) or Array conditional must be same shape as self instead of the expected errors.

Assignment to multiple columns of a DataFrame when some columns do not exist
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`)
Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed with the right values. (:issue:`13658`)

@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value):
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
else:
if all(is_hashable(k) for k in key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this seems strange to me. I feel like this should be hashable inherently so clarification here would be helpful

@@ -197,6 +198,19 @@ def _get_setitem_indexer(self, key):
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this gets simplified if the hashable stuff can be removed, but can you add a comment here as to what is going on? Not immediately apparent what all of these conditions are

def test_setitem_list_missing_columns(self, float_frame, columns, box):
# GH 26534
result = float_frame.copy()
result[columns] = box(float_frame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not really sure what the point of this parametrization is - can you construct test(s) that don't overwrite the existing column data before the assignment? Seems unrelated unless I am missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I parametrize the tests as #26534 (comment) suggest. What do you propose?

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

@howsiwei have been away for a while, can you merge master and i'll look again.

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

@howsiwei can you merge master and i'll have another look

@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

closing as stale but @howsiwei ping if you'd like to pick this backup

@WillAyd WillAyd closed this Oct 22, 2019
@howsiwei
Copy link
Contributor Author

howsiwei commented Nov 1, 2019

@WillAyd I'll pick this back up in these few days.

@WillAyd
Copy link
Member

WillAyd commented Nov 1, 2019

Sounds good. I don't have the option to reopen though - did you delete the branch on GitHub? If so unfortunately might need to push a new PR

@howsiwei
Copy link
Contributor Author

howsiwei commented Nov 1, 2019

I suspect it's because I rebase the branch on latest master.

@howsiwei
Copy link
Contributor Author

howsiwei commented Nov 2, 2019

I have created a new PR #29334.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assignment to multiple columns only works if they existed before
6 participants