You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The case being considered here is when setting multiple columns into a DataFrame (using __setitem__, df[[..]] = ..), using a DataFrame right-hand-side value. So a simple, unambiguous example is:
I think this is "expected" behaviour. Meaning, this seems to be intentional and long standing behaviour. Although I personally find this surprisin, especially because when using loc instead of plain setitem, i.e. df1.loc[:, ['a', 'b']] = df2[['b', 'a']], does align the column names:
On the other hand, if I change the column names in df2 to also have duplicate columns, but in a different order, depending on the exact order you get an error or a "working" example:
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'b'])
>>> df1[['a', 'b']] = df2
...
ValueError: Columns must be same length as key
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'a'])
>>> df1[['a', 'b']] = df2
>>> df1
a b b
0 0 2 4
1 6 8 10
2 12 14 16
And if the columns names order matches exactly, the columns are set "correctly" as well:
So in general, in those examples, the column names do matter.
General questions:
Are we OK with __setitem__ (df[key] = value) with a dataframe value ignoring the value's column names? (not aligning key and value.columns) And are we OK with this being different as .loc[]?
If we keep the current behaviour, should we set those columns by position instead of column name, so that also for duplicate column names you don't get such inconsistent results?
(but how to we change this? (it's a breaking change) maybe we should deprecate/disallow such setitem with duplicate column names?)
The text was updated successfully, but these errors were encountered:
The case being considered here is when setting multiple columns into a DataFrame (using
__setitem__
,df[[..]] = ..
), using a DataFrame right-hand-side value. So a simple, unambiguous example is:However, we are setting the multiple columns column-by-column in order, ignoring potential misaligned column names:
I think this is "expected" behaviour. Meaning, this seems to be intentional and long standing behaviour. Although I personally find this surprisin, especially because when using
loc
instead of plain setitem, i.e.df1.loc[:, ['a', 'b']] = df2[['b', 'a']]
, does align the column names:I didn't directly find an issue about this, only a PR that touched the code that handles this but in case of duplicate columns (#39403), and a comment at https://github.com/pandas-dev/pandas/pull/39341/files#r563895152 about column names being irrelevant for setitem (cc @phofl @jbrockmendel)
But, because of the fact that we ignore alignment of column names, but then do the setting by name (and not position):
pandas/pandas/core/frame.py
Lines 3747 to 3750 in dd6869f
you get inconsistent results with duplicate column names.
For example, in this case the second column of
df2
is set to both "b" columns ofdf1
On the other hand, if I change the column names in
df2
to also have duplicate columns, but in a different order, depending on the exact order you get an error or a "working" example:And if the columns names order matches exactly, the columns are set "correctly" as well:
So in general, in those examples, the column names do matter.
General questions:
__setitem__
(df[key] = value
) with a dataframevalue
ignoring the value's column names? (not aligningkey
andvalue.columns
) And are we OK with this being different as.loc[]
?(but how to we change this? (it's a breaking change) maybe we should deprecate/disallow such setitem with duplicate column names?)
The text was updated successfully, but these errors were encountered: