Skip to content

BUG: Unexpected KeyError message when using .loc with MultiIndex in a possible edge-case #51892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
arkal opened this issue Mar 11, 2023 · 3 comments · Fixed by #52194
Closed
3 tasks done
Assignees
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@arkal
Copy link

arkal commented Mar 11, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

This code produces a KeyError with an incorrect message (aka incorrect key tested)

df = pd.DataFrame({(1,2): ['a', 'b', 'c'],
                   (1,3): ['d', 'e', 'f'],
                   (2,2): ['g', 'h', 'i'],
                   (2,4): ['j', 'k', 'l']})
print(df)

# (1,4) is invalid as a pair but 
#      1 is a valid primary key with [2,3] as secondary keys 
# and  4 is a valid secondary key for the primary key 2
df.loc[0, (1, 4)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1101, in __getitem__
    return self._getitem_tuple(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1284, in _getitem_tuple
    return self._getitem_lowerdim(tup)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 980, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1081, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1347, in _getitem_axis
    return self._get_label(key, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1297, in _get_label
    return self.obj.xs(label, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/generic.py", line 4069, in xs
    return self[key]
           ~~~~^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 3739, in __getitem__
    return self._getitem_multilevel(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 3794, in _getitem_multilevel
    loc = self.columns.get_loc(key)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexes/multi.py", line 2825, in get_loc
    return self._engine.get_loc(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 832, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 2152, in pandas._libs.hashtable.UInt64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 2176, in pandas._libs.hashtable.UInt64HashTable.get_item
KeyError: 20

FWIW: this is a similarly wrong key but notice the different integer in the KeyError

df.loc[0, (2, 3)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1101, in __getitem__
    return self._getitem_tuple(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1284, in _getitem_tuple
    return self._getitem_lowerdim(tup)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 980, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1081, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1347, in _getitem_axis
    return self._get_label(key, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexing.py", line 1297, in _get_label
    return self.obj.xs(label, axis=axis)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/generic.py", line 4069, in xs
    return self[key]
           ~~~~^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 3739, in __getitem__
    return self._getitem_multilevel(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/frame.py", line 3794, in _getitem_multilevel
    loc = self.columns.get_loc(key)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pandas/core/indexes/multi.py", line 2825, in get_loc
    return self._engine.get_loc(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pandas/_libs/index.pyx", line 832, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 2152, in pandas._libs.hashtable.UInt64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 2176, in pandas._libs.hashtable.UInt64HashTable.get_item
KeyError: 27


### Issue Description

In a multi indexed dataframe, pd.DataFrame.loc with a partially or completely incorrect index raises a KeyError and provides the missing key in the error message. However, providing a key pair `(kouter_1, kinner_1)` where `(kouter_1, kinner_1)` is not a valid keypair but `(kouter_2, kinner_1)` is a valid keypair (i.e the secondary key exists in the union of all secondary keys but is not a valid partner for that primary key) raises the expected KeyError but an unhelpful random integer as the incorrect key it is attempting to retrieve from the df. Interestingly, the integer isn't the same across machines or versions, however on the same machine + version the value is consistent. I've tested this with 1. 3.4, 1.5.0, 1.5.3, and 2.0.0rc0

### Expected Behavior

The call to `df.loc[0, (1, 4)]` above should raise the KeyError

KeyError: (1, 4)


### Installed Versions

/usr/local/lib/python3.11/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 1a2e300170efc08cb509a0b4ff6248f8d55ae777
python           : 3.11.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.15.0-1030-aws
Version          : #34~20.04.1-Ubuntu SMP Tue Jan 24 15:16:46 UTC 2023
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.0rc0
numpy            : 1.24.2
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 65.5.1
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : None
qtpy             : None
pyqt5            : None
@arkal arkal added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 11, 2023
@jayendra-patil33
Copy link
Contributor

take

@jayendra-patil33
Copy link
Contributor

jayendra-patil33 commented Mar 11, 2023

I am close to finding a root cause for this issue, I will update here in a while.

@jayendra-patil33
Copy link
Contributor

import pandas as pd
df = pd.DataFrame({(1,2): ['a', 'b', 'c'],
                   (1,3): ['d', 'e', 'f'],
                   (2,2): ['g', 'h', 'i'],
                   (2,4): ['j', 'k', 'l']})

mi = df.axes[1]
print(mi.get_loc((1, 4)))

Internally get_loc() gets called on the MultiIndex object and produces the same traceback

Traceback (most recent call last):
  File "/home/user/code/pandas/test.py", line 8, in <module>
    print(mi.get_loc((2, 3)))
  File "/home/user/code/pandas/pandas/core/indexes/multi.py", line 2812, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 842, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
    return self._base.get_loc(self, lab_int)
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
    cpdef get_loc(self, object val):
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
    return self.mapping.get_item(val)
  File "pandas/_libs/hashtable_class_helper.pxi", line 2152, in pandas._libs.hashtable.UInt64HashTable.get_item
    cpdef get_item(self, uint64_t val):
  File "pandas/_libs/hashtable_class_helper.pxi", line 2176, in pandas._libs.hashtable.UInt64HashTable.get_item
    raise KeyError(val)
KeyError: 27

Now if we call get_loc_level() on the same object with the same arguments with get this output

Traceback (most recent call last):
  File "/home/user/code/pandas/pandas/core/indexes/multi.py", line 2978, in _get_loc_level
    return (self._engine.get_loc(key), None)
  File "pandas/_libs/index.pyx", line 842, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
    return self._base.get_loc(self, lab_int)
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
    cpdef get_loc(self, object val):
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
    return self.mapping.get_item(val)
  File "pandas/_libs/hashtable_class_helper.pxi", line 2152, in pandas._libs.hashtable.UInt64HashTable.get_item
    cpdef get_item(self, uint64_t val):
  File "pandas/_libs/hashtable_class_helper.pxi", line 2176, in pandas._libs.hashtable.UInt64HashTable.get_item
    raise KeyError(val)
KeyError: 27

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/code/pandas/test.py", line 8, in <module>
    print(mi.get_loc_level((2, 3)))
  File "/home/user/code/pandas/pandas/core/indexes/multi.py", line 2909, in get_loc_level
    loc, mi = self._get_loc_level(key, level=level)
  File "/home/user/code/pandas/pandas/core/indexes/multi.py", line 2980, in _get_loc_level
    raise KeyError(key) from err
KeyError: (2, 3)

Both these methods internally make a call

try:
return self._engine.get_loc(key)
except TypeError:
# e.g. test_partial_slicing_with_multiindex partial string slicing
loc, _ = self.get_loc_level(key, list(range(self.nlevels)))
return loc

try:
return (self._engine.get_loc(key), None)
except KeyError as err:
raise KeyError(key) from err
except TypeError:
# e.g. partial string indexing
# test_partial_string_timestamp_multiindex
pass

except, the second method has it wrapped in an except block which reraises the exception

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
2 participants