BUG: Can not select in MultiIndex with NaN #56366

moi90 · 2023-12-06T18:17:07Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(
    [[False, None, False], [True, None, False], [True, True, False]],
    columns=["a", "b", "c"],
)
counts = df.value_counts(dropna=False)

# This works
counts.loc[(True, True, False)]

# This does not work: KeyError: (None, True, False)
counts.loc[(None, True, False)]

# This does not work: KeyError: (<NA>, True, False)
counts.loc[(pd.NA, True, False)]

# This works
counts[tuple(df.loc[2])]

# This does not work: ValueError: Multi-dimensional indexing (e.g. `obj[:, None]`) is no longer supported. Convert to a numpy array before indexing instead.
counts[tuple(df.loc[0])]

Issue Description

When selecting values from a MultiIndex where one (or multiple) level contain None / NaN, I get a KeyError.

Expected Behavior

I need to be able to use None in a MultiIndex.

A workaround would be to replace None with something else. Is there a better option?)

Installed Versions

INSTALLED VERSIONS

commit : 2a953cf
python : 3.12.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-89-generic
Version : #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8

pandas : 2.1.3
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-12-26T23:20:17Z

Thanks for the report, this is related to #48476. value_counts is utilizing groupby, which converts the None to np.nan in the result. Unfortunately, using np.nan to get the row doesn't work either. This is because np.nan != np.nan and often you get a view on np.nan:

df = pd.DataFrame(
    [[False, None, False], [True, None, False], [True, True, False]],
    columns=["a", "b", "c"],
)
counts = df.value_counts(dropna=False)
nan_from_index = counts.index[1][1]
print(nan_from_index)
# nan
print(nan_from_index is np.nan)
# False

Because of this behavior, the lookup fails.

@jorisvandenbossche @phofl - were lookups involving NA part of the discussion in the ice cream agreement?

moi90 · 2024-01-02T16:22:33Z

Thanks for the explanation. I worked around it by changing NaN to another value.

avm19 · 2024-05-16T14:28:41Z

First of all, @moi90 example has a mistake, which does not invalidate it, but still. Instead of

counts.loc[(pd.NA, True, False)]  # KeyError: (<NA>, True, False)

it should be

counts.loc[(True, pd.NA, False)]  # KeyError: (True, <NA>, False)

Note that (<NA>, True, False) is genuinely not in the MultiIndex, so we expect an error. But (True, <NA>, False) is in the MultiIndex and we expect no error!

Second, there is a closely related behaviour:

counts.loc[(slice(None), pd.NA, slice(None))]  # KeyError: <NA>

The error stack points to raise KeyError(key) in here:

pandas/pandas/core/indexes/multi.py

Lines 3331 to 3335 in 58461fe

    
           locs = np.asarray(level_codes == idx, dtype=bool) 
        
           if not locs.any(): 
        
               # The label is present in self.levels[level] but unused: 
        
               raise KeyError(key)

I believe that the pd.NA in the user's .loc query is encoded as idx == -1, while level_codes encodes NaNs as 0, which is evident from:

counts.index.get_level_values(1)  # Index([nan, nan, True], dtype='object', name='b')
counts.index.codes[1]  # array([0, 0, 1], dtype=int8)

which is why I doubt that the real reason for these bugs is the one @rhshadrach mentions, namely:

This is because np.nan != np.nan

(but I am not 100% sure and will leave this to experts)

P.S. What is "Ice Cream Agreement"? Some bug reports just make you salivate.

bbgw · 2024-05-27T06:13:06Z

The issue here is in None in MultiIndex used with GroupBy.

For example, if to remove GroupBy following @rhshadrach 's example above :

tmp = df.set_index(['a', 'b'])
tmp.index[0][1] is numpy.nan
# True
tmp.loc[(False, numpy.nan)] # works as expected
c    False
Name: (False, nan), dtype: bool

It's a suprise to see in @rhshadrach 's example that:

print(nan_from_index is np.nan)
# False

even though:

counts.index[0][1] is counts.index[1][1]
# True

Unfortunately, using np.nan to get the row doesn't work either. This is because np.nan != np.nan

Is it possible to get this fixed in GroupBy, for np.nan is np.nan? I stopped at pandas.core.sorting.compress_group_index reading that:

Group_index is offsets into cartesian product of all possible labels. This
space can be huge, so this function compresses it, by computing offsets
(comp_ids) into the list of unique labels (obs_group_ids).

avm19 · 2024-05-27T14:18:13Z

Okay, this indeed seems to be related to GroupBy and not just to MultiIndex. Here is another illustrative example:

df = pd.DataFrame(
    [[False, None, False], [True, None, False], [True, True, False]],
    columns=["a", "b", "c"],
)
counts1 = df.value_counts(dropna=False)
mi = pd.MultiIndex.from_tuples([*df.values], names=['a', 'b', 'c'])
counts2 = pd.Series(index=mi, data=[1,1,1], name='count')
# counts1 and counts2 appear to have same content:
# a      b     c    
# False  NaN   False    1
# True   NaN   False    1
#        True  False    1
# Name: count, dtype: int64

# However, the content is different
counts1.index.codes  # FrozenList([[0, 1, 1], [0, 0, 1], [0, 0, 0]])
counts2.index.codes  # FrozenList([[0, 1, 1], [-1, -1, 0], [0, 0, 0]])

counts2[counts2.index[[0]]]  # Works
counts2[counts2.index[0]]  # Works
counts1[counts1.index[[0]]]  # Works
counts1[counts1.index[0]]  # KeyError: (False, nan, False)

avm19 · 2024-05-27T21:26:05Z

I think I solved it. I can open a PR soon, in a week or two, meanwhile I'll describe my findings.

The problem is in BaseMultiIndexCodesEngine.get_loc() method":

pandas/pandas/_libs/index.pyx

Lines 807 to 816 in b162331

    
           def get_loc(self, object key): 
        
               if is_definitely_invalid_key(key): 
        
                   raise TypeError(f"'{key}' is an invalid key") 
        
               if not isinstance(key, tuple): 
        
                   raise KeyError(key) 
        
               try: 
        
                   indices = [1 if checknull(v) else lev.get_loc(v) + multiindex_nulls_shift 
        
                              for lev, v in zip(self.levels, key)] 
        
               except KeyError: 
        
                   raise KeyError(key)

The mistake is the use of checknull(v) to determine the appropriate code for an element of a tuple-key. Thus, BaseMultiIndexCodesEngine.get_loc makes this decision based on the value of the key, as opposed to the code of the value. This makes the get_loc's value-based logic incompatible with BaseMultiIndexCodesEngine._extract_level_codes(), which uses the code-based rule:

pandas/pandas/_libs/index.pyx

Lines 779 to 788 in b162331

    
           level_codes = list(target._recode_for_new_levels(self.levels)) 
        
           for i, codes in enumerate(level_codes): 
        
               if self.levels[i].hasnans: 
        
                   na_index = self.levels[i].isna().nonzero()[0][0] 
        
                   codes[target.codes[i] == -1] = na_index 
        
               codes += 1 
        
               codes[codes > 0] += 1 
        
               if self.level_has_nans[i]: 
        
                   codes[target.codes[i] == -1] += 1 
        
           return self._codes_to_ints(np.array(level_codes, dtype=self._codes_dtype).T)

I suppose, this can be fixed by checking level_has_nans in get_loc() like this:

[1 if (checknull(v) and has_nan) else lev.get_loc(v) + multiindex_nulls_shift
                       for lev, v, has_nan in zip(self.levels, key, self.level_has_nans)]

It remains to check if this fix breaks anything, but I guess it does not.

In any case, a PR should add the following test for MultiIndex:

mi = MultiIndex(...)  # random codes, values, including NA
for i in range(len(mi)):
    assert i == mi.get_loc(mi[i])

@rhshadrach Please let me know if all this makes sense.

rhshadrach · 2024-06-03T01:03:48Z

Thanks for the investigation @avm19. Your proposal makes sense, but I think would not work in the pathological case where a MultiIndex has NA values both in the levels and -1 for codes, e.g.

pd.MultiIndex(
   levels=[[np.nan], [0]],
   codes=[[0, -1], [0, 0]],
)

I think this is a consequence of groupby requiring nonnegative codes for NA values when dropna=True (the code is used to determine the index of the result, and so must be nonnegative). If we change the codes to be negative for NA values, then it works. E.g.

df = pd.DataFrame(
    [[False, None], [True, None]],
    columns=["a", "b"],
)
counts = df.value_counts(dropna=False)
print(counts)
# a      b
# False  NaN    1
# True   NaN    1
# Name: count, dtype: int64

print(counts.index.codes)
# [[0, 1], [0, 0]]

print(counts.loc[(False, np.nan)])
# Raises KeyError

mi = pd.MultiIndex(
    [[False, True], []],
    [[0, 1], [-1, -1]],
    names=["a", "b"],
)
counts.index = mi
print(counts.loc[(False, np.nan)])
# 1

So to me the question is: do we support MultiIndex with NA in the levels as well as -1 for codes, or should we force NA values correspond to -1 codes. I can't say I have a good understanding of the consequences.

moi90 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 6, 2023

rhshadrach added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint labels Dec 26, 2023

chaoyihu mentioned this issue Jun 28, 2024

BUG: None value in MultiIndex can't be found #59003

Closed

3 tasks

rhshadrach added the Indexing Related to indexing on series/frames, not to indexes themselves label Nov 21, 2024

rhshadrach mentioned this issue Nov 21, 2024

BUG: cannot locate multiIndex column when NaT is in column #60388

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Can not select in MultiIndex with NaN #56366

BUG: Can not select in MultiIndex with NaN #56366

moi90 commented Dec 6, 2023

INSTALLED VERSIONS

rhshadrach commented Dec 26, 2023

moi90 commented Jan 2, 2024

avm19 commented May 16, 2024 •

edited

Loading

bbgw commented May 27, 2024 •

edited

Loading

avm19 commented May 27, 2024

avm19 commented May 27, 2024

rhshadrach commented Jun 3, 2024

BUG: Can not select in MultiIndex with NaN #56366

BUG: Can not select in MultiIndex with NaN #56366

Comments

moi90 commented Dec 6, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Dec 26, 2023

moi90 commented Jan 2, 2024

avm19 commented May 16, 2024 • edited Loading

bbgw commented May 27, 2024 • edited Loading

avm19 commented May 27, 2024

avm19 commented May 27, 2024

rhshadrach commented Jun 3, 2024

avm19 commented May 16, 2024 •

edited

Loading

bbgw commented May 27, 2024 •

edited

Loading