DataFrame.loc[] prioritizes columns when setting with missing label #19110

toobaz · 2018-01-06T20:47:43Z

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame(np.arange(16).reshape(4, 4), index=pd.MultiIndex.from_product([[1, 2], ['a', 'b']]), columns=['a', 'b', 'c', 'd'])

In [3]: df.loc[2, 'a'] # select a row: good
Out[3]: 
a     8
b     9
c    10
d    11
Name: (2, a), dtype: int64

In [4]: df.loc[2, 'c'] # select a (part of) col: guessing game, but I understand it is a feature
Out[4]: 
a    10
b    14
Name: c, dtype: int64

In [5]: df.loc[2, 'e'] = -1 # now there is no column: add a row?

In [6]: df # ... nope, still adds a column
Out[6]: 
      a   b   c   d    e
1 a   0   1   2   3  NaN
  b   4   5   6   7  NaN
2 a   8   9  10  11 -1.0
  b  12  13  14  15 -1.0

In [7]: df.loc[3, 'f'] = -2 # what if the row label is entirely missing?

In [8]: df # sitll adds a row _and_ a col
Out[8]: 
        a     b     c     d    e    f
1 a   0.0   1.0   2.0   3.0  NaN  NaN
  b   4.0   5.0   6.0   7.0  NaN  NaN
2 a   8.0   9.0  10.0  11.0 -1.0  NaN
  b  12.0  13.0  14.0  15.0 -1.0  NaN
3     NaN   NaN   NaN   NaN  NaN -2.0

Problem description

In general, if df.index is a MultiIndex, pandas interprets the syntax df.loc[a, b] as df.loc[(a,b),:].

Out[4]: is (debatable, but) understandable: in absence of the desired row, and in presence of a column with the same name, it interprets as df.loc[(a,), b].

However, there is no reason why Out[5]: and Out[6]: should add a column: since priority when labels are present goes to the index, the same should happen when labels are absent.

Somewhat related to #17024 .

Expected Output

In [8]: df
Out[8]: 
        a     b     c     d
1 a   0.0   1.0   2.0   3.0
  b   4.0   5.0   6.0   7.0
2 a   8.0   9.0  10.0  11.0
  b  12.0  13.0  14.0  15.0
  e  -1.0  -1.0  -1.0  -1.0
3 f  -2.0  -2.0  -2.0  -2.0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.0.dev0+42.g93033151a
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2018-01-06T20:55:38Z

i think we should refuse to guess in these cases
raising an ambiguity error

toobaz · 2018-01-06T21:06:12Z

You mean in all cases (In [3]: too)? That's gonna break a lot of code, and I think that's a bit a waste to make the df.loc[2, 'a'] syntax entirely unusable when df.index is a MultiIndex.

If instead you mean only from In [4]: onwards, that might still break quite a bit of code, but I agree it's worth investigating. Although then I would find it more consistent (with In [3]:) to just interpret as df.loc[(2, 'a'), :], but I understand that, for backward compatibility, changing behavior is even worse than disabling it.

toobaz · 2018-01-06T21:07:44Z

(interpreting df.loc[2, 'a'] as df.loc[(2, 'a'), :] is the original sin, but I guess it's far too late)

jreback · 2018-01-06T21:10:09Z

yes [3] is prob ok, maybe [4] as well. But I am mainly talking about setting.

toobaz · 2018-01-08T11:47:12Z

OK, I guess the only solution feasible now is to just change In [5]: and In [7]: to adding row, and leave unchanged the rest (In [3]:, In [4]:).

jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Difficulty Advanced labels Jan 6, 2018

jreback added this to the Next Major Release milestone Jan 6, 2018

toobaz mentioned this issue Feb 2, 2018

DOC: improve docs to clarify MultiIndex indexing #19507

Merged

toobaz mentioned this issue May 4, 2018

Addressing multiindex raises TypeError if indices that are rightmost are not present #20951

Closed

toobaz mentioned this issue May 18, 2018

Proposal: Deprecating support of incomplete indexing on MultiIndexes #10574

Closed

toobaz mentioned this issue Jul 8, 2019

DataFrame.__setitem__ with MultiIndex fails when expanding with new key #27248

Open

jbrockmendel removed Effort Low labels Oct 21, 2019

mroeschke added Bug and removed API Design labels Jun 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.loc[] prioritizes columns when setting with missing label #19110

DataFrame.loc[] prioritizes columns when setting with missing label #19110

toobaz commented Jan 6, 2018 •

edited

Loading

INSTALLED VERSIONS

jreback commented Jan 6, 2018

toobaz commented Jan 6, 2018

toobaz commented Jan 6, 2018

jreback commented Jan 6, 2018

toobaz commented Jan 8, 2018

DataFrame.loc[] prioritizes columns when setting with missing label #19110

DataFrame.loc[] prioritizes columns when setting with missing label #19110

Comments

toobaz commented Jan 6, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jan 6, 2018

toobaz commented Jan 6, 2018

toobaz commented Jan 6, 2018

jreback commented Jan 6, 2018

toobaz commented Jan 8, 2018

toobaz commented Jan 6, 2018 •

edited

Loading

Output of `pd.show_versions()`