Skip to content

BUG: Index and MultiIndex KeyError cases and discussion #39775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
attack68 opened this issue Feb 12, 2021 · 1 comment
Open

BUG: Index and MultiIndex KeyError cases and discussion #39775

attack68 opened this issue Feb 12, 2021 · 1 comment
Labels
API Design Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@attack68
Copy link
Contributor

Since the introduction of KeyError for missing keys in an index there have been quite a few use cases from different issues. I will try and link some of the issues if I see them.

My view is that KeyErrors for Index is fine, but MultiIndexes should be treated differently: you cannot always raise a KeyError for a single keys in a MultiIndex slice since a MultiIndex cannot always be reindexed.

Index

indexes = [
    pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'),
    pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(),
]

Screen Shot 2021-02-12 at 13 23 16

Code generator
ret = None
def do(command):
    try:
        exec(f'global ret; ret={command}', globals())
    except KeyError:
        return 'KeyError'
    else:
        if isinstance(ret, (np.int64)):
            return 'int64'
        elif isinstance(ret, (pd.Series)):
            return 'Series'
        elif isinstance(ret, (pd.DataFrame)):
            return 'DataFrame'
        return 'OtherType'

cases = [
"'a'",           # single valid key
"'!'",           # single invalid key 
"['a']",         # single valid key as pseudo multiple valid keys
"['!']",         # single invalid key as pseudo multip valid keys
"['a','e']",     # multiple valid keys
"['a','!']",     # at least one invalid keys  
"'a':'e'",       # valid key slice
"'a':'!'",       # at least one invalid slice key
"'!':",          # at least one invalid slice key
"'b'",           # single valid non-unique key
"['b']",         # single valid non-unique key as pseudo multiple keys
"'b':'d'",       # slice with non-unique key
]

base = [
    [f's.loc[{case}]',        # use regular s.loc[]
     f's.loc[ix[{case}]]']    # and with index slice as comparison s.loc[ix[{}]]
    for case in cases
]
commands = [
    command for sublist in base for command in sublist
]

indexes = [
    pd.Index(['a','b','c','e','d'], name='Unique Non-Monotonic'),
    pd.Index(['a','b','c','e','d'], name='Unique Monotonic').sort_values(),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Non-Monotonic'),
    pd.Index(['a','b','b','e','d'], name='Non-Unique Monotonic').sort_values(),
]

results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
    s = pd.Series([1,2,3,4,5], index=index)
    for i, command in enumerate(commands):
        results.iloc[i, j] = do(command)

This seems to be pretty consistent. The only inconsistency is perhaps highlighted in red, and a minor niggle for dynamic coding might be the different return types in the case of non-unique indexes.

Obviously the solution to dealing with any case where you need to index by pre-defined levels that may have been filtered is to reindex with your pre-defined keys. Any this is quite easy to do in RAM.

MultiIndex

indexes = [
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]),
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0],
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]),
    pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0],
]

MultiIndexing is different. You cannot always reindex for one of two reasons:

  • The number of possible combinations of the index level values exceeds ram and is computationally slow.
  • If you are to add in a value or set of values to a MultiIndex level the process is ambiguous and expanding all combinations will lead to above problems.

For example, consider the MultiIndex levels: (a,b), (x,y,z). There are a maximum of 6 index tuples but practically one will work with indexes of much less than the maximum combinations (Since the combinations scale exponentially with the number of levels). Your MultiIndex is thus [(a,x), (a,z), (b,x), (b,y)].

I think you need to be able to index MultiIndexes with keys that are missing. As a rule I would suggest that slices which are an iterable do not yield KeyErrors. Here is a summary of some of the observances below for current behaviour:

[a, y] : KeyError
[a, [y]] : KeyError but should return empty (a in level0)
[[a], y] : KeyError but should return empty (y in level1)
[[a], [y]] : KeyError but should return empty 
[a, !] : KeyError 
[a, [!]] : returns empty
[[a], !] : KeyError (maybe OK since ! not in level1)
[[!], x] : returns empty (x in level1)
[[!], [!]] : returns empty
[!, !] : KeyError

multiindex_slice

Code generator
cases_level0 = [
  "'a'",         # single valid key on level0
  "'!'",         # single invalid key on level0
  "['a']",       # single valid key on level0 as pseudo multiple valid keys
  "['!']",       # single invalid key on level0 as pseudo multiple valid keys
  "['a', 'b']",  # multiple valid key on level0
  "['a', '!']",  # at least one invalid key on level0
  "'a':'b'",     # valid level0 index slice
  "'a':'!'",     # invalid level0 index slice
  "'!':",        # fully invalid level0 index slice
]

comments_level0 = [
'0: valid single, ',
'0: invalid single, ',
'0: valid single as multiple, ',
'0: invalid single as multiple, ',
'0: multiple valid, ',
'0: one invalid in multiple, ',
'0: valid slice, ',
'0: semi-invalid slice, ',
'0: invalid slice, ',
]

base = [
  [f's.loc[{case}]', f's.loc[ix[{case}, :]]']  for case in cases_level0
]
commands = [
  command for sublist in base for command in sublist
]

indexes = [
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]),
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'z'), ('a','x'), ('a', 'z')]).sortlevel()[0],
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]),
  pd.MultiIndex.from_tuples([('b','x'), ('b', 'y'), ('b', 'x'), ('a','x'), ('a', 'z')]).sortlevel()[0],
]

results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
  s = pd.Series([1,2,3,4,5], index=index)
  for i, command in enumerate(commands):
      results.iloc[i,j] = do(command)

base = [
  [com, com]  for com in comments_level0
]
comments = [
  com for sublist in base for com in sublist
]
      
results['Comment'] = comments    
results.style

cases_level1 = [
  "'x'",         # single valid key on level1
  "'y'",         # single sometimes-valid key on level1
  "'!'",         # single invalid key on level1
  "['x']",       # single valid key on level1 as pseudo multiple valid keys
  "['y']",       # single sometimes-valid key on level1 as pseudo multiple valid keys
  "['!']",       # single invalid key on level1 as pseudo multiple valid keys
  "['x', 'y']",  # multiple sometimes-valid key on level1
  "['x', '!']",  # at least one invalid key on level1
  "'x':'y'",     # sometimes-valid level0 index slice
  "'x':'!'",     # invalid level1 index slice
  "'!':",        # fully invalid level1 index slice
]

comments_level1 = [
'1: valid single',
'1: semi-valid single',
'1: invalid single',
'1: valid single as multiple',
'1: semi-valid single as multiple',
'1: invalid single as multiple',
'1: multiple semi-valid',
'1: one invalid in multiple',   
'1: semi-valid slice',
'1: semi-invalid slice',
'1: invalid slice',  
]

from itertools import product
multi_cases = list(product(cases_level0, cases_level1))
multi_comments = list(product(comments_level0, comments_level1))

base = [
  [f's.loc[{case[0]}, {case[1]}]', f's.loc[ix[{case[0]}, {case[1]}]]']  for case in multi_cases
]
commands = [
  command for sublist in base for command in sublist
]

results = pd.DataFrame('', index=commands, columns=['Uq Non-Mono', 'Uq Mono', 'Non-Uq Non-Mono', 'Non-Uq Mono'])
for j, index in enumerate(indexes):
  s = pd.Series([1,2,3,4,5], index=index)
  for i, command in enumerate(commands):
      results.iloc[i,j] = do(command)

base = [
  [com, com]  for com in multi_comments
]
comments = [
  com for sublist in base for com in sublist
]        

results['comment'] = comments
results.style\
     .applymap(lambda v: 'background-color:red;', subset=ix[["s.loc['a', ['!']]", "s.loc['a', ['y']]", "s.loc[['!'], ['!']]", "s.loc[['a'], ['y']]"], :])\
     .applymap(lambda v: 'background-color:LemonChiffon;', subset=ix[["s.loc['a', 'x':'y']"], :])

@attack68 attack68 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 12, 2021
@jbrockmendel jbrockmendel added API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 13, 2021
@mroeschke mroeschke added Bug Error Reporting Incorrect or improved errors from pandas labels Aug 15, 2021
@kthyng
Copy link

kthyng commented Oct 19, 2023

I also met this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

4 participants