Skip to content

BUG: DataFrame.__setitem__ not raising ValueError when rhs is df and has wrong number of columns #39341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,7 @@ Indexing
- Bug in :meth:`DataFrame.__setitem__` raising ``ValueError`` with empty :class:`DataFrame` and specified columns for string indexer and non empty :class:`DataFrame` to set (:issue:`38831`)
- Bug in :meth:`DataFrame.loc.__setitem__` raising ValueError when expanding unique column for :class:`DataFrame` with duplicate columns (:issue:`38521`)
- Bug in :meth:`DataFrame.iloc.__setitem__` and :meth:`DataFrame.loc.__setitem__` with mixed dtypes when setting with a dictionary value (:issue:`38335`)
- Bug in :meth:`DataFrame.__setitem__` not raising ``ValueError`` when right hand side is a :class:`DataFrame` with wrong number of columns (:issue:`38604`)
- Bug in :meth:`DataFrame.loc` dropping levels of :class:`MultiIndex` when :class:`DataFrame` used as input has only one row (:issue:`10521`)
- Bug in setting ``timedelta64`` values into numeric :class:`Series` failing to cast to object dtype (:issue:`39086`)
- Bug in setting :class:`Interval` values into a :class:`Series` or :class:`DataFrame` with mismatched :class:`IntervalDtype` incorrectly casting the new values to the existing dtype (:issue:`39120`)
Expand Down
20 changes: 14 additions & 6 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3223,7 +3223,7 @@ def _setitem_array(self, key, value):
self._check_setitem_copy()
self.iloc[indexer] = value
else:
if isinstance(value, DataFrame):
if isinstance(value, DataFrame) and self.columns.is_unique:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be a problem on L3230 if value.columns is not unique?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No but maybe at 3227

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends of the key references a dup column

Copy link
Member Author

@phofl phofl Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me rephrase this:
We don't get there, becasue 3227 would raise. If we would change the condition to recognize duplicates, we would get a problem in 3229, because the lenght of both lists would not match.
If we also fix this, we have to check that in case of a key which is duplicated in the columns, we select two columns and not only one in 3230 on the rhs

Edit: Similar problem if rhs would have duplicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah the other case im now unsure of is not all(x in self.columns for x in key)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, this has ugly side effects. We have to find a way to handle duplicates there instead of dispatching down :(

And improve the test coverage of these cases

if len(value.columns) != len(key):
raise ValueError("Columns must be same length as key")
for k1, k2 in zip(key, value.columns):
Expand Down Expand Up @@ -3256,12 +3256,20 @@ def _setitem_frame(self, key, value):
def _set_item_frame_value(self, key, value: DataFrame) -> None:
self._ensure_valid_index(value)

# align right-hand-side columns if self.columns
# is multi-index and self[key] is a sub-frame
if isinstance(self.columns, MultiIndex) and key in self.columns:
# align columns
if key in self.columns:
loc = self.columns.get_loc(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment: # align columns before L3259

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if isinstance(loc, (slice, Series, np.ndarray, Index)):
cols = maybe_droplevels(self.columns[loc], key)
cols = self.columns[loc]
len_cols = 1 if is_scalar(cols) else len(cols)
if len_cols != len(value.columns):
raise ValueError("Columns must be same length as key")

# align right-hand-side columns if self.columns
# is multi-index and self[key] is a sub-frame
if isinstance(self.columns, MultiIndex) and isinstance(
loc, (slice, Series, np.ndarray, Index)
):
cols = maybe_droplevels(cols, key)
if len(cols) and not cols.equals(value.columns):
value = value.reindex(cols, axis=1)

Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/frame/indexing/test_setitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,23 @@ def test_setitem_frame_duplicate_columns(self):
)
tm.assert_frame_equal(df, expected)

@pytest.mark.parametrize("cols", [["a", "b", "c"], ["a", "a", "a"]])
def test_setitem_df_wrong_column_number(self, cols):
# GH#38604
df = DataFrame([[1, 2, 3]], columns=cols)
rhs = DataFrame([[10, 11]], columns=["d", "e"])
msg = "Columns must be same length as key"
with pytest.raises(ValueError, match=msg):
df["a"] = rhs

def test_setitem_listlike_indexer_duplicate_columns(self):
# GH#38604
df = DataFrame([[1, 2, 3]], columns=["a", "b", "b"])
rhs = DataFrame([[10, 11, 12]], columns=["d", "e", "c"])
df[["a", "b"]] = rhs
expected = DataFrame([[10, 11, 12]], columns=["a", "b", "b"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this obvious? is rhs.columns irrelevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with obvious?

Yes columns are irrelevant with setitem

tm.assert_frame_equal(df, expected)


class TestDataFrameSetItemWithExpansion:
def test_setitem_listlike_views(self):
Expand Down