Skip to content

DataFrame.loc[] prioritizes columns when setting with missing label #19110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
toobaz opened this issue Jan 6, 2018 · 5 comments
Open

DataFrame.loc[] prioritizes columns when setting with missing label #19110

toobaz opened this issue Jan 6, 2018 · 5 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@toobaz
Copy link
Member

toobaz commented Jan 6, 2018

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame(np.arange(16).reshape(4, 4), index=pd.MultiIndex.from_product([[1, 2], ['a', 'b']]), columns=['a', 'b', 'c', 'd'])

In [3]: df.loc[2, 'a'] # select a row: good
Out[3]: 
a     8
b     9
c    10
d    11
Name: (2, a), dtype: int64

In [4]: df.loc[2, 'c'] # select a (part of) col: guessing game, but I understand it is a feature
Out[4]: 
a    10
b    14
Name: c, dtype: int64

In [5]: df.loc[2, 'e'] = -1 # now there is no column: add a row?

In [6]: df # ... nope, still adds a column
Out[6]: 
      a   b   c   d    e
1 a   0   1   2   3  NaN
  b   4   5   6   7  NaN
2 a   8   9  10  11 -1.0
  b  12  13  14  15 -1.0

In [7]: df.loc[3, 'f'] = -2 # what if the row label is entirely missing?

In [8]: df # sitll adds a row _and_ a col
Out[8]: 
        a     b     c     d    e    f
1 a   0.0   1.0   2.0   3.0  NaN  NaN
  b   4.0   5.0   6.0   7.0  NaN  NaN
2 a   8.0   9.0  10.0  11.0 -1.0  NaN
  b  12.0  13.0  14.0  15.0 -1.0  NaN
3     NaN   NaN   NaN   NaN  NaN -2.0

Problem description

In general, if df.index is a MultiIndex, pandas interprets the syntax df.loc[a, b] as df.loc[(a,b),:].

Out[4]: is (debatable, but) understandable: in absence of the desired row, and in presence of a column with the same name, it interprets as df.loc[(a,), b].

However, there is no reason why Out[5]: and Out[6]: should add a column: since priority when labels are present goes to the index, the same should happen when labels are absent.

Somewhat related to #17024 .

Expected Output

In [8]: df
Out[8]: 
        a     b     c     d
1 a   0.0   1.0   2.0   3.0
  b   4.0   5.0   6.0   7.0
2 a   8.0   9.0  10.0  11.0
  b  12.0  13.0  14.0  15.0
  e  -1.0  -1.0  -1.0  -1.0
3 f  -2.0  -2.0  -2.0  -2.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.23.0.dev0+42.g93033151a
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Jan 6, 2018

i think we should refuse to guess in these cases
raising an ambiguity error

@toobaz
Copy link
Member Author

toobaz commented Jan 6, 2018

You mean in all cases (In [3]: too)? That's gonna break a lot of code, and I think that's a bit a waste to make the df.loc[2, 'a'] syntax entirely unusable when df.index is a MultiIndex.

If instead you mean only from In [4]: onwards, that might still break quite a bit of code, but I agree it's worth investigating. Although then I would find it more consistent (with In [3]:) to just interpret as df.loc[(2, 'a'), :], but I understand that, for backward compatibility, changing behavior is even worse than disabling it.

@toobaz
Copy link
Member Author

toobaz commented Jan 6, 2018

(interpreting df.loc[2, 'a'] as df.loc[(2, 'a'), :] is the original sin, but I guess it's far too late)

@jreback
Copy link
Contributor

jreback commented Jan 6, 2018

yes [3] is prob ok, maybe [4] as well. But I am mainly talking about setting.

@jreback jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Difficulty Advanced labels Jan 6, 2018
@jreback jreback added this to the Next Major Release milestone Jan 6, 2018
@toobaz
Copy link
Member Author

toobaz commented Jan 8, 2018

OK, I guess the only solution feasible now is to just change In [5]: and In [7]: to adding row, and leave unchanged the rest (In [3]:, In [4]:).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

4 participants