-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: assignment to multiple columns when some column do not exist #26534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26534 +/- ##
==========================================
- Coverage 91.76% 91.76% -0.01%
==========================================
Files 174 174
Lines 50629 50632 +3
==========================================
- Hits 46462 46461 -1
- Misses 4167 4171 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26534 +/- ##
=========================================
Coverage ? 91.87%
=========================================
Files ? 174
Lines ? 50701
Branches ? 0
=========================================
Hits ? 46581
Misses ? 4120
Partials ? 0
Continue to review full report at Codecov.
|
06ee05b
to
b732bf9
Compare
@jreback any feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howsiwei Thanks for the PR.
IIUC the issue relates to the scalar assignment to multiple columns of a DataFrame using setitem only.
test_loc_setitem_corner in pandas\tests\series\indexing\test_loc.py is currently failing. Changing the behavior of Series assignment is not covered in the scope of the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is adding quite a bit of code; please try to streamline and not special case
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -425,6 +425,7 @@ Indexing | |||
- Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`). | |||
- Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`) | |||
- Allow keyword arguments for callable local reference used in the :method:`DataFrame.query` string (:issue:`26426`) | |||
- Bug in assignment to multiple columns of a `DataFrame` when some of the columns do not exist (:issue:`13658`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :class:`DataFrame`
; this is assignment with a scalar, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should allow the same set of values as when the columns are present, eg. scalar, 1d array, 2d array, dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
:class:`DataFrame`
can you update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Hi, thanks for the feedback. I have updated to a much cleaner code. |
c9ff25a
to
d3190d3
Compare
98de71c
to
eddd29b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests are much easier to read. thanks.
a few minor comments on the tests
we are also going to need another test for scalar assignment with some missing columns to cover the other issue. unless we have one.
do we have a test to check these cases raise a warning on getting after this change?
I think we should also have a test to now check that the message is not raised when setting (or do we have a test for this?)
not looked at the actual code, too many tests are currently failing.
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -425,6 +425,7 @@ Indexing | |||
- Bug in which :meth:`DataFrame.to_csv` caused a segfault for a reindexed data frame, when the indices were single-level :class:`MultiIndex` (:issue:`26303`). | |||
- Fixed bug where assigning a :class:`arrays.PandasArray` to a :class:`pandas.core.frame.DataFrame` would raise error (:issue:`26390`) | |||
- Allow keyword arguments for callable local reference used in the :method:`DataFrame.query` string (:issue:`26426`) | |||
- Bug in assignment to multiple columns of a `DataFrame` when some of the columns do not exist (:issue:`13658`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
:class:`DataFrame`
can you update.
pandas/core/indexing.py
Outdated
labels = item_labels[info_idx] | ||
if has_missing_columns: | ||
labels = [idx['key'] if isinstance(idx, dict) else | ||
item_labels[idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howsiwei you are adding WAY too much complexity here. You need to see the minimal change set to make this change work, it is likely way simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trace other setter calls and see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback mind suggesting where I should make the changes? The code for setting values is quite complicated and it's not obvious to me at all how to achieve the desired result with fewer changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback since value
can have multiple types (eg. scalar, 1d array, 2d array, DataFrame
), it's not easy to split value
into value for each column. The only way I see to not duplicate the value splitting code is to modify the helper function setter
, where value
has already been split. But I don't see how the changes can be made much simpler when modifying this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback any updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback thanks for the suggestion. I'm guessing what you mean is that we could simply insert the column before setting. There are 2 downsides to this approach though:
- It may be less efficient.
- More importantly, currently
pandas
would try to infer the fill values for a column. Consider the following code:
import pandas as pd
df = pd.DataFrame({'a': [0, 0]})
df.loc[0, 'b'] = pd.Timestamp('20120101')
df.loc[0, 'c'] = 1.0
print(df)
Output:
a b c
0 0 2012-01-01 00:00:00 1.0
1 0 NaT NaN
Note that column b
and c
are filled with NaT
and NaN
respectively.
To infer the fill values for each column, we need to split the values into separated columns first. But this splitting is only done after line 465.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already do this for expansions
my point is you are introducing another code path here
so either remove and fix the existing one (non trivial) or integrate this change
the dtype changes are expected in the current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already do this for expansions
Do what?
the dtype changes are expected in the current implementation
What do you mean? Where does the dtype change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Hmm after reading this again I think I understand what you said. Do you mean that using NaN
as fill values for all columns is fine because this is already done elsewhere in pandas code? What are expansions though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback can you please confirm if I should just fix this issue by inserting columns filled with NaN
first?
pandas/tests/frame/test_indexing.py
Outdated
r" \[columns\]") | ||
with pytest.raises(KeyError, match=msg): | ||
self.frame.ix[:, ['E']] = 1 | ||
# partial setting now allows this GH13658 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don;t comment out, simply remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented out the test following the test immediately below on line 1136. Should it be removed too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes try that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes try that
Yes for this question too?
@jreback can you please confirm if I should just fix this issue by inserting columns filled with
NaN
first?
pandas/core/indexing.py
Outdated
labels = item_labels[info_idx] | ||
if has_missing_columns: | ||
labels = [idx['key'] if isinstance(idx, dict) else | ||
item_labels[idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is what I would do. I would touch no code here at all; rather add what you need into the loop above around 328. This handles the missing indexer case.
Then we can assert that we have no missing items at all, IOW what you have at line 307 (what you define has_missing columns), could be true there but MUST be false at line 465.
This entire setting routine is very complex. We want to make this simpler and easier to understand; that may take a bigger refactor, but let's start by not adding additional complexity as much as possible.
i know this is non-trivial, but I am holding the line here because this is already pretty impenetrable code.
love for you to make it better!
@jreback what do you think about the current changes? Also do you have any idea why It's failing on Linux py37_np_dev? |
pandas/core/indexing.py
Outdated
return self._get_listlike_indexer(obj, axis, **kwargs)[1] | ||
try: | ||
# When setting, missing keys are not allowed, even with # .loc: | ||
kwargs = {"raise_missing": True if is_setter else raise_missing} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't want to do this in a try/except as its extremely error prone. Instead if you have a is_setter you can just make sure the columns exist (but doing the _convert_to_indexer).
I think if you do this first e.g. almost first in _setitem_with_indexer you won't need to make the code change you have above about taking the split path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I force take_split_path = True
to work around #27583. Otherwise I would fail tests such as test_setitem_list_all_missing_columns_scalar
in pandas/tests/frame/test_indexing.py
.
pandas/tests/frame/test_indexing.py
Outdated
@@ -208,6 +208,43 @@ def test_setitem_list_of_tuples(self, float_frame): | |||
expected = Series(tuples, index=float_frame.index, name="tuples") | |||
assert_series_equal(result, expected) | |||
|
|||
def test_setitem_list_all_missing_columns_scalar(self, float_frame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize these, typicall we add an arg called box to do this, eg.
@pytest.mark.parameterize('box', [lambda x: 1, lambda x: [1, 2], lambda x: x[['B', 'C']]])
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to get the expected value for different right hand side. Furthermore, how do I parametrize the left-hand side?
pandas/tests/indexing/test_loc.py
Outdated
@@ -808,6 +808,46 @@ def test_loc_setitem_with_scalar_index(self, indexer, value): | |||
|
|||
assert is_scalar(result) and result == "Z" | |||
|
|||
def test_loc_setitem_missing_columns_scalar_index_list_value(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
doc/source/whatsnew/v0.25.1.rst
Outdated
@@ -85,7 +85,7 @@ Interval | |||
Indexing | |||
^^^^^^^^ | |||
|
|||
- | |||
- Bug in assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist (:issue:`13658`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need a subsection to explain the change. move to 1.0 as this is a non-trivial change of a longstanding issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subsection added.
de154f1
to
7cbd30f
Compare
692a290
to
41a560e
Compare
be1b5e3
to
414fe51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeff may be offline for a bit. We'll get to this before 1.0 though.
@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value): | |||
for k1, k2 in zip(key, value.columns): | |||
self[k1] = value[k2] | |||
else: | |||
if all(is_hashable(k) for k in key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what cases can elements of key
not be hashable? These are our eventual column names, right? Which I think we require they be hashable.
I'm a bit concerned about the performance here, for the case when key
is large (which is possible, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason that I check whether key
is hashable is to produce the desired errors, which is tested in
pandas/pandas/tests/indexing/test_indexing.py
Lines 180 to 184 in e55b698
with pytest.raises( | |
(ValueError, AttributeError, TypeError, pd.core.indexing.IndexingError), | |
match=msg, | |
): | |
idxr[nd3] = 0 |
If producing all these errors are not required then this check can be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger what's your thought on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the check is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error produced is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this seems strange to me. I feel like this should be hashable inherently so clarification here would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without testing is_hashable
,
pandas/pandas/tests/indexing/test_indexing.py
Line 184 in 3c0cf22
idxr[nd3] = 0 |
Buffer has wrong number of dimensions (expected 1, got 2)
or Array conditional must be same shape as self
instead of the expected errors.
Assignment to multiple columns of a DataFrame when some columns do not exist | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed withe the right values. (:issue:`13658`) | |
Assignment to multiple columns of a :class:`DataFrame` when some of the columns do not exist would previously assign the values to the last column. Now, new columns would be constructed with the right values. (:issue:`13658`) |
@@ -3007,6 +3007,12 @@ def _setitem_array(self, key, value): | |||
for k1, k2 in zip(key, value.columns): | |||
self[k1] = value[k2] | |||
else: | |||
if all(is_hashable(k) for k in key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this seems strange to me. I feel like this should be hashable inherently so clarification here would be helpful
@@ -197,6 +198,19 @@ def _get_setitem_indexer(self, key): | |||
def __setitem__(self, key, value): | |||
if isinstance(key, tuple): | |||
key = tuple(com.apply_if_callable(x, self.obj) for x in key) | |||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this gets simplified if the hashable
stuff can be removed, but can you add a comment here as to what is going on? Not immediately apparent what all of these conditions are
def test_setitem_list_missing_columns(self, float_frame, columns, box): | ||
# GH 26534 | ||
result = float_frame.copy() | ||
result[columns] = box(float_frame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not really sure what the point of this parametrization is - can you construct test(s) that don't overwrite the existing column data before the assignment? Seems unrelated unless I am missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I parametrize the tests as #26534 (comment) suggest. What do you propose?
@howsiwei have been away for a while, can you merge master and i'll look again. |
@howsiwei can you merge master and i'll have another look |
closing as stale but @howsiwei ping if you'd like to pick this backup |
@WillAyd I'll pick this back up in these few days. |
Sounds good. I don't have the option to reopen though - did you delete the branch on GitHub? If so unfortunately might need to push a new PR |
I suspect it's because I rebase the branch on latest master. |
I have created a new PR #29334. |
git diff upstream/master -u -- "*.py" | flake8 --diff
In particular, the following code now behaves correctly.
v0.22: error
master: column
'c'
is converted to index-1
, which causes last column to be overwritten.After this PR: