BUG: Index.isin inconsistency with missing value #30677

proost · 2020-01-04T08:14:14Z

DataFrame case,

In [15]: df = pd.DataFrame({'num_legs': [None, 4], 'num_wings': [np.nan, 0]},ind
    ...: ex=['falcon', 'dog'])                                                  

In [16]: df.isin([np.nan, None])                                                
Out[16]: 
        num_legs  num_wings
falcon      True       True
dog        False      False

In [17]: df.isin([None, None])                                                  
Out[17]: 
        num_legs  num_wings
falcon      True       True
dog        False      False

In [18]: df.isin([np.nan, np.nan])                                              
Out[18]: 
        num_legs  num_wings
falcon      True       True
dog        False      False

But Index case,

In [20]: idx = Index([np.nan,'a','b', None])                                    

In [21]: idx.isin([None])                                                       
Out[21]: array([False, False, False,  True])

In [22]: idx.isin([np.nan])                                                     
Out[22]: array([ True, False, False, False])

Problem description

'.isin' handle NA values differently depending on object types. 'np.nan' and 'None' both are NA value. so, 'Index.isin' do not distinguish 'np.nan' and 'None' like 'DataFrame' case.

Expected Output

In [20]: idx = Index([np.nan,'a','b', None])                                    

In [21]: idx.isin([None])                                                       
Out[21]: array([True, False, False,  True])

In [22]: idx.isin([np.nan])                                                     
Out[22]: array([ True, False, False, True])

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.0.0-37-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : ko_KR.UTF-8
LOCALE : ko_KR.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

The text was updated successfully, but these errors were encountered:

proost · 2020-01-11T09:40:32Z

Quite difficult determine it is wrong or not. Because in "Index",

idx = Index([np.nan,'a','b', None])                                    
idx.get_loc(None)

result is 3, Not

array([ True, False, False,  True])

and also "contains" magic function

assert np.datetime64("NaT") in idx

raise exception.
if some wants to check actual "None" value not in terms of NA, then it is more accutrate. But i think "None", "np.nan", "np.nat" as a NA value. but then, maybe needs to change all method in "Index". because as you can see above, "Index" discern "None", "np.nan" ..etc.

So i want to take a any opinion what is more better way.

Dr-Irv · 2020-09-05T15:26:47Z

I don't think this is a bug. In the DataFrame case, when you assign the None values, the dtype of the columns is inferred to be float64, and the None values get converted to np.nan . But if you create a column with a dtype of object, you can differentiate between None and np.nan. :

In [4]: df = pd.DataFrame({'num_legs': [None, 4], 'num_wings': [np.nan, 0]}, in
   ...: dex=['falcon', 'dog'])
   ...: df
Out[4]:
        num_legs  num_wings
falcon       NaN        NaN
dog          4.0        0.0

In [5]: df['num_heads'] = pd.Series([None, np.nan], dtype=object, index=df.inde
   ...: x)
   ...: df
Out[5]:
        num_legs  num_wings num_heads
falcon       NaN        NaN      None
dog          4.0        0.0       NaN

In [6]: df.dtypes
Out[6]:
num_legs     float64
num_wings    float64
num_heads     object
dtype: object

In [7]: df.isin([None])
Out[7]:
        num_legs  num_wings  num_heads
falcon     False      False       True
dog        False      False      False

In [8]: df.isin([np.nan])
Out[8]:
        num_legs  num_wings  num_heads
falcon      True       True      False
dog        False      False       True

In your example, when you created the Index, it defaulted to a dtype of object.

proost changed the title ~~BUG: pd.Index.isin inconsistency with missing value~~ BUG: Index.isin inconsistency with missing value Jan 4, 2020

proost mentioned this issue Jan 7, 2020

ENH: pd.MultiIndex.get_loc(np.nan) #28919

Merged

5 tasks

Dr-Irv closed this as completed Sep 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Index.isin inconsistency with missing value #30677

BUG: Index.isin inconsistency with missing value #30677

proost commented Jan 4, 2020 •

edited

Loading

INSTALLED VERSIONS

proost commented Jan 11, 2020 •

edited

Loading

Dr-Irv commented Sep 5, 2020

BUG: Index.isin inconsistency with missing value #30677

BUG: Index.isin inconsistency with missing value #30677

Comments

proost commented Jan 4, 2020 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

proost commented Jan 11, 2020 • edited Loading

Dr-Irv commented Sep 5, 2020

proost commented Jan 4, 2020 •

edited

Loading

Output of `pd.show_versions()`

proost commented Jan 11, 2020 •

edited

Loading