-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Can not select in MultiIndex with NaN #56366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report, this is related to #48476.
Because of this behavior, the lookup fails. @jorisvandenbossche @phofl - were lookups involving NA part of the discussion in the ice cream agreement? |
Thanks for the explanation. I worked around it by changing NaN to another value. |
First of all, @moi90 example has a mistake, which does not invalidate it, but still. Instead of
it should be
Note that Second, there is a closely related behaviour: counts.loc[(slice(None), pd.NA, slice(None))] # KeyError: <NA> The error stack points to pandas/pandas/core/indexes/multi.py Lines 3331 to 3335 in 58461fe
I believe that the counts.index.get_level_values(1) # Index([nan, nan, True], dtype='object', name='b')
counts.index.codes[1] # array([0, 0, 1], dtype=int8) which is why I doubt that the real reason for these bugs is the one @rhshadrach mentions, namely:
(but I am not 100% sure and will leave this to experts) P.S. What is "Ice Cream Agreement"? Some bug reports just make you salivate. |
The issue here is in For example, if to remove tmp = df.set_index(['a', 'b'])
tmp.index[0][1] is numpy.nan
# True
tmp.loc[(False, numpy.nan)] # works as expected
c False
Name: (False, nan), dtype: bool It's a suprise to see in @rhshadrach 's example that: print(nan_from_index is np.nan)
# False even though: counts.index[0][1] is counts.index[1][1]
# True
Is it possible to get this fixed in
|
Okay, this indeed seems to be related to df = pd.DataFrame(
[[False, None, False], [True, None, False], [True, True, False]],
columns=["a", "b", "c"],
)
counts1 = df.value_counts(dropna=False)
mi = pd.MultiIndex.from_tuples([*df.values], names=['a', 'b', 'c'])
counts2 = pd.Series(index=mi, data=[1,1,1], name='count')
# counts1 and counts2 appear to have same content:
# a b c
# False NaN False 1
# True NaN False 1
# True False 1
# Name: count, dtype: int64
# However, the content is different
counts1.index.codes # FrozenList([[0, 1, 1], [0, 0, 1], [0, 0, 0]])
counts2.index.codes # FrozenList([[0, 1, 1], [-1, -1, 0], [0, 0, 0]])
counts2[counts2.index[[0]]] # Works
counts2[counts2.index[0]] # Works
counts1[counts1.index[[0]]] # Works
counts1[counts1.index[0]] # KeyError: (False, nan, False) |
I think I solved it. I can open a PR soon, in a week or two, meanwhile I'll describe my findings. The problem is in Lines 807 to 816 in b162331
The mistake is the use of Lines 779 to 788 in b162331
I suppose, this can be fixed by checking [1 if (checknull(v) and has_nan) else lev.get_loc(v) + multiindex_nulls_shift
for lev, v, has_nan in zip(self.levels, key, self.level_has_nans)] It remains to check if this fix breaks anything, but I guess it does not. In any case, a PR should add the following test for mi = MultiIndex(...) # random codes, values, including NA
for i in range(len(mi)):
assert i == mi.get_loc(mi[i]) @rhshadrach Please let me know if all this makes sense. |
Thanks for the investigation @avm19. Your proposal makes sense, but I think would not work in the pathological case where a MultiIndex has NA values both in the levels and -1 for codes, e.g. pd.MultiIndex(
levels=[[np.nan], [0]],
codes=[[0, -1], [0, 0]],
) I think this is a consequence of groupby requiring nonnegative codes for NA values when df = pd.DataFrame(
[[False, None], [True, None]],
columns=["a", "b"],
)
counts = df.value_counts(dropna=False)
print(counts)
# a b
# False NaN 1
# True NaN 1
# Name: count, dtype: int64
print(counts.index.codes)
# [[0, 1], [0, 0]]
print(counts.loc[(False, np.nan)])
# Raises KeyError
mi = pd.MultiIndex(
[[False, True], []],
[[0, 1], [-1, -1]],
names=["a", "b"],
)
counts.index = mi
print(counts.loc[(False, np.nan)])
# 1 So to me the question is: do we support MultiIndex with NA in the levels as well as -1 for codes, or should we force NA values correspond to -1 codes. I can't say I have a good understanding of the consequences. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When selecting values from a MultiIndex where one (or multiple) level contain None / NaN, I get a KeyError.
Expected Behavior
I need to be able to use None in a MultiIndex.
A workaround would be to replace None with something else. Is there a better option?)
Installed Versions
INSTALLED VERSIONS
commit : 2a953cf
python : 3.12.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-89-generic
Version : #99-Ubuntu SMP Mon Oct 30 20:42:41 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8
pandas : 2.1.3
numpy : 1.26.2
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: