Skip to content

BUG: assignment to multiple columns when some column do not exist #26534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,37 @@ Backwards incompatible API changes
- :class:`pandas.core.groupby.GroupBy.transform` now raises on invalid operation names (:issue:`27489`).
-

.. _whatsnew_1000.api_breaking.multicolumn_assignment:

Assignment to multiple columns of a DataFrame when some columns do not exist
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`)
Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed with the right values. (:issue:`13658`)


.. ipython:: python

df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
df

*Previous behavior*:

.. code-block:: ipython

In [3]: df[['a', 'c']] = 1
In [4]: df
Out[4]:
a b
0 1 1
1 1 1
2 1 1

*New behavior*:

.. ipython:: python

df[['a', 'c']] = 1
df

.. _whatsnew_1000.api.other:

Other API changes
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value):
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]
else:
if all(is_hashable(k) for k in key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what cases can elements of key not be hashable? These are our eventual column names, right? Which I think we require they be hashable.

I'm a bit concerned about the performance here, for the case when key is large (which is possible, right?)

Copy link
Contributor Author

@howsiwei howsiwei Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason that I check whether key is hashable is to produce the desired errors, which is tested in

with pytest.raises(
(ValueError, AttributeError, TypeError, pd.core.indexing.IndexingError),
match=msg,
):
idxr[nd3] = 0

If producing all these errors are not required then this check can be omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger what's your thought on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the check is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error produced is different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this seems strange to me. I feel like this should be hashable inherently so clarification here would be helpful

Copy link
Contributor Author

@howsiwei howsiwei Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without testing is_hashable,

may throw Buffer has wrong number of dimensions (expected 1, got 2) or Array conditional must be same shape as self instead of the expected errors.

for k in key:
try:
self[k]
except KeyError:
self[k] = np.nan
indexer = self.loc._get_listlike_indexer(
key, axis=1, raise_missing=False
)[1]
Expand Down
14 changes: 14 additions & 0 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from pandas.core.dtypes.common import (
ensure_platform_int,
is_float,
is_hashable,
is_integer,
is_integer_dtype,
is_iterator,
Expand Down Expand Up @@ -197,6 +198,19 @@ def _get_setitem_indexer(self, key):
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this gets simplified if the hashable stuff can be removed, but can you add a comment here as to what is going on? Not immediately apparent what all of these conditions are

self.name == "loc"
and len(key) > 1
and is_list_like_indexer(key[1])
and not isinstance(key[1], tuple)
and not com.is_bool_indexer(key[1])
and all(is_hashable(k) for k in key[1])
):
for k in key[1]:
try:
self.obj[k]
except KeyError:
self.obj[k] = np.nan
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
Expand Down
41 changes: 23 additions & 18 deletions pandas/tests/frame/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,29 @@ def test_setitem_list_of_tuples(self, float_frame):
expected = Series(tuples, index=float_frame.index, name="tuples")
assert_series_equal(result, expected)

@pytest.mark.parametrize("columns", [["A", "E"], ["E", "F"]])
@pytest.mark.parametrize(
"box",
[
lambda x: 1,
lambda x: [1, 2],
lambda x: np.array([1, 2]),
lambda x: x[["B", "C"]],
lambda x: x[["B", "A"]].values,
lambda x: x[["A", "C"]].values.tolist(),
],
)
def test_setitem_list_missing_columns(self, float_frame, columns, box):
# GH 26534
result = float_frame.copy()
result[columns] = box(float_frame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not really sure what the point of this parametrization is - can you construct test(s) that don't overwrite the existing column data before the assignment? Seems unrelated unless I am missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I parametrize the tests as #26534 (comment) suggest. What do you propose?

expected = float_frame.copy()
for col in columns:
if col not in expected.columns:
expected[col] = np.nan
expected[columns] = box(float_frame)
assert_frame_equal(result, expected)

def test_setitem_mulit_index(self):
# GH7655, test that assigning to a sub-frame of a frame
# with multi-index columns aligns both rows and columns
Expand Down Expand Up @@ -501,13 +524,6 @@ def test_setitem(self, float_frame):
float_frame["col6"] = series
tm.assert_series_equal(series, float_frame["col6"], check_names=False)

msg = (
r"\"None of \[Float64Index\(\[.*dtype='float64'\)\] are in the"
r" \[columns\]\""
)
with pytest.raises(KeyError, match=msg):
float_frame[np.random.randn(len(float_frame) + 1)] = 1

# set ndarray
arr = np.random.randn(len(float_frame))
float_frame["col9"] = arr
Expand Down Expand Up @@ -1143,17 +1159,6 @@ def test_fancy_index_int_labels_exceptions(self, float_frame):
)
with pytest.raises(KeyError, match=msg):
float_frame.ix[["foo", "bar", "baz"]] = 1
msg = (
r"None of \[Index\(\['E'\], dtype='object'\)\] are in the"
r" \[columns\]"
)
with pytest.raises(KeyError, match=msg):
float_frame.ix[:, ["E"]] = 1

# FIXME: don't leave commented-out
# partial setting now allows this GH2578
# pytest.raises(KeyError, float_frame.ix.__setitem__,
# (slice(None, None), 'E'), 1)

def test_setitem_fancy_mixed_2d(self, float_string_frame):

Expand Down
24 changes: 24 additions & 0 deletions pandas/tests/indexing/test_loc.py
Original file line number Diff line number Diff line change
Expand Up @@ -808,6 +808,30 @@ def test_loc_setitem_with_scalar_index(self, indexer, value):

assert is_scalar(result) and result == "Z"

@pytest.mark.parametrize(
"index,box",
[
((1, ["C", "D"]), [7, 8]),
(
(slice(None, None, None), ["A", "C"]),
pd.DataFrame([[7, 8], [9, 10], [11, 12]], columns=["A", "C"]),
),
(([0, 2], ["B", "C", "D"]), 9),
((slice(1, 3, None), ["B", "C", "D"]), [[7, 8, 9], [10, 11, 12]]),
],
)
def test_loc_setitem_missing_columns(self, index, box):
# GH 26534
df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
result = df.copy()
result.loc[index] = box
expected = df.copy()
for col in index[1]:
if col not in expected.columns:
expected[col] = np.nan
expected.loc[index] = box
tm.assert_frame_equal(result, expected)

def test_loc_coercion(self):

# 12411
Expand Down