BUG: loc raising error for MultiIndex with bool indexer _and first level of object dtype_ #50165

toobaz · 2022-12-10T10:41:57Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

values = {
          False :
          {'A' : range(2), 'B' : range(2)},
          True :
          {'A' : range(2), 'B' : range(2)}
         }

values = pd.concat({k : pd.DataFrame(values[k]) for k in values})

values.loc[False]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-40-cca3b469b9a2> in <module>
----> 1 values.loc[False]

~/.local/lib/python3.9/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    965 
    966             maybe_callable = com.apply_if_callable(key, self.obj)
--> 967             return self._getitem_axis(maybe_callable, axis=axis)
    968 
    969     def _is_scalar_access(self, key: tuple):

~/.local/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1199 
   1200         # fall thru to straight lookup
-> 1201         self._validate_key(key, axis)
   1202         return self._get_label(key, axis=axis)
   1203 

~/.local/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1009             or self.obj._get_axis(axis).dtype.name == "boolean"
   1010         ):
-> 1011             raise KeyError(
   1012                 f"{key}: boolean label can not be used without a boolean index"
   1013             )

KeyError: 'False: boolean label can not be used without a boolean index'

Issue Description

Issue #47687 was fixed in #49766 by checking that if the index is a MultiIndex its first level must be of type bool.

However, this still seems inconsistent with other dtypes, and code paths - for instance values.loc[(False, ), :] works flawlessly with the example I am providing above.

I argue that the original sin is in

https://github.com/pandas-dev/pandas/pull/40726/files#diff-07282871a235456baecd3a55a43339b0da374ca270228b5530f5449aabdca2e1R968

That PR was meant to solve a specific issue, that is, that in some cases a boolean label is casted to int, and from there a series of unexpected behaviors occur.

But there is no reason on earth why Boolean labels shouldn't just follow the rules for labels of other dtypes, we just want to ensure that any undesired casting is avoided across any code path.

By the way, I checked that neither of the two problematic examples in #20432 work any more, even removing all code from _LocIndexer._validate_key().

So I think we should do it.

Thoughts @mroeschke ?

Expected Behavior

Same as values.loc[(False, ), :]

Installed Versions

INSTALLED VERSIONS

commit : cab77e3
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.18.0-0.deb11.4-amd64
Version : #1 SMP PREEMPT_DYNAMIC Debian 5.18.16-1~bpo11+1 (2022-08-12)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : it_IT.UTF-8
LOCALE : it_IT.UTF-8

pandas : 2.0.0.dev0+881.gcab77e3d6c.dirty
numpy : 1.23.3
pytz : 2021.1
dateutil : 2.8.2
setuptools : 52.0.0
pip : 20.3.4
Cython : 0.29.32
pytest : 6.0.2
hypothesis : 5.43.3
sphinx : 3.4.3
blosc : 1.9.2
feather : None
xlsxwriter : 1.1.2
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.6
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.4
brotli : 1.0.9
fastparquet : None
fsspec : 0.8.4
gcsfs : None
matplotlib : 3.3.4
numba : None
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyreadstat : 1.1.0
pyxlsb : None
s3fs : None
scipy : 1.6.0
snappy : None
sqlalchemy : 1.3.22
tables : 3.6.1
tabulate : None
xarray : 0.16.1
xlrd : 2.0.1
zstandard : None
tzdata : None
qtpy : 1.9.0
pyqt5 : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2022-12-12T18:56:36Z

This seems to work on main (e3143f6)?

In [1]: values = {
   ...:           False :
   ...:           {'A' : range(2), 'B' : range(2)},
   ...:           True :
   ...:           {'A' : range(2), 'B' : range(2)}
   ...:          }
   ...:
   ...: values = pd.concat({k : pd.DataFrame(values[k]) for k in values})
   ...:
   ...: values.loc[False]
Out[1]:
   A  B
0  0  0
1  1  1

In [2]: values.loc[(False, ), :]
Out[2]:
   A  B
0  0  0
1  1  1

In [3]: pd.__version__
Out[3]: '2.0.0.dev0+895.ge3143f6119'

I guess there should be a test where the other MultiIndex level isn't bool necessarily?

toobaz · 2022-12-12T23:18:44Z

This seems to work on main (e3143f6)?

Wow, that was quick!

... except that what changed is the DataFrame index level data type - not the bug itself, which (as of 5ee4dac ) still emerges with:

In [2]: values = {
   ...:           False :
   ...:           {'A' : range(2), 'B' : range(2)},
   ...:           42 :
   ...:           {'A' : range(2), 'B' : range(2)}
   ...:          }
   ...: 
   ...: values = pd.concat({k : pd.DataFrame(values[k]) for k in values})
   ...: 
   ...: values.loc[False]

So I still think getting rid of the content of _LocIndexer._validate_key() is the way to go.

mroeschke · 2022-12-13T01:33:16Z

Ah okay yeah I see the bug still exists when the inferred dtype of the object column is more strictly mixed

toobaz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 10, 2022

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 12, 2022

mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed good first issue Needs Tests Unit test(s) needed to prevent regressions labels Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: loc raising error for MultiIndex with bool indexer _and first level of object dtype_ #50165

BUG: loc raising error for MultiIndex with bool indexer _and first level of object dtype_ #50165

toobaz commented Dec 10, 2022 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Dec 12, 2022

toobaz commented Dec 12, 2022 •

edited

Loading

mroeschke commented Dec 13, 2022

BUG: loc raising error for MultiIndex with bool indexer _and first level of object dtype_ #50165

BUG: loc raising error for MultiIndex with bool indexer _and first level of object dtype_ #50165

Comments

toobaz commented Dec 10, 2022 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

mroeschke commented Dec 12, 2022

toobaz commented Dec 12, 2022 • edited Loading

mroeschke commented Dec 13, 2022

toobaz commented Dec 10, 2022 •

edited

Loading

toobaz commented Dec 12, 2022 •

edited

Loading