Skip to content

BUG: Cannot index into DataFrame with Nullable Integer as index #34497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elinor-lev opened this issue May 31, 2020 · 4 comments · Fixed by #45167
Closed

BUG: Cannot index into DataFrame with Nullable Integer as index #34497

elinor-lev opened this issue May 31, 2020 · 4 comments · Fixed by #45167
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays
Milestone

Comments

@elinor-lev
Copy link

elinor-lev commented May 31, 2020

Code Sample

import pandas as pd
import numpy as np

df = pd.DataFrame(data=np.array([[1,2,3,4],[5,6,7,8],[1,2,np.nan,np.nan]]).T,columns=('a','b','c'),dtype="Int64")
df2 = df.set_index('c')

df2.loc[1]

Problem description

When there is more than one missing value in the dataframe index, indexing into the dataframe results in

TypeError: Cannot convert bool to numpy.ndarray

If only one NaN index exists, this works:

df = pd.DataFrame(data=np.array([[1,2,3,4],[5,6,7,8],[1,2,3,np.nan]]).T,columns=('a','b','c'),dtype="Int64")
df2 = df.set_index('c')
df2.loc[1]
a    1
b    5
Name: 1, dtype: Int64

Expected Output

I Expect df2.loc[1] to return

a    1
b    5
Name: 1, dtype: Int64

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-53-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.1.3.post20200330
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.2.1
html5lib : 0.999999999
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.6.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.2.1
matplotlib : 3.1.2
numexpr : 2.6.4
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.4.2
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

@elinor-lev elinor-lev added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 31, 2020
@elinor-lev elinor-lev changed the title BUG: BUG: Cannot index into DataFrame with Nullable Integer as index May 31, 2020
@dsaxton dsaxton added Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 31, 2020
@dsaxton
Copy link
Member

dsaxton commented May 31, 2020

Note that it's actually the presence of duplicates rather than more than one NA value that causes this to fail (it's the uniqueness in the second case that makes it work). This for example also raises:

df = pd.DataFrame(
    [1, 2, 3],
    index=pd.Index([1, 1, pd.NA]),
)
df.loc[1]

It seems the two cases follow separate code paths in pandas/_libs/index.pyx.

@elinor-lev
Copy link
Author

That's actually not accurate; in the current version of pandas, there is no issue with duplicate entries in the index list. Without the presence of pd.NA, this actually works:

df = pd.DataFrame(
    [1, 2, 3],
    index=pd.Index([1, 1, 2]),
)
df.loc[1]

producing:

	0
1	1
1	2

I believe it has to do with the non-equality nature of different NaNs, and how masking with NA is handled, which is discussed in #31503.

@jorisvandenbossche
Copy link
Member

@elinor-lev what Daniel meant is that it is the fact that there is a duplicate (whether the NA or another number) is what is triggering it, as without duplicates but with pd.NA, it works (but it is of course still the presence of pd.NA that triggers the bug):

In [1]: df = pd.DataFrame( 
   ...:     [1, 2, 3], 
   ...:     index=pd.Index([1, 2, pd.NA]), 
   ...: ) 
   ...: df.loc[1] 
Out[1]: 
0    1
Name: 1, dtype: int64

As indeed, whether there is any duplicate or not triggers a different code path.

There are still several problems with having pd.NA values in object dtype, see also #32931

@elinor-lev and thanks for the report!

@jorisvandenbossche
Copy link
Member

I believe it has to do with the non-equality nature of different NaNs, and how masking with NA is handled, which is discussed in #31503.

Yes, I think that is correct. The error happens in this function:

cdef _maybe_get_bool_indexer(self, object val):
cdef:
ndarray[uint8_t, ndim=1, cast=True] indexer
indexer = self._get_index_values() == val
return self._unpack_bool_indexer(indexer, val)

Where indexer is expected to be a boolean ndarray. However, due to a numpy bug (there are fixing it, but for now it only gives a deprecation warning), we get a single False:

# this is what `_get_index_values()` above returns
In [9]: df.index._engine.vgetter()  
Out[9]: array([1, 1, <NA>], dtype=object)

In [10]: df.index._engine.vgetter() == 1  
/home/joris/miniconda3/envs/dev/bin/ipython:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  #!/home/joris/miniconda3/envs/dev/bin/python
Out[10]: False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants