Skip to content

BUG: Index.isin inconsistency with missing value #30677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
proost opened this issue Jan 4, 2020 · 2 comments
Closed

BUG: Index.isin inconsistency with missing value #30677

proost opened this issue Jan 4, 2020 · 2 comments

Comments

@proost
Copy link
Contributor

proost commented Jan 4, 2020

DataFrame case,

In [15]: df = pd.DataFrame({'num_legs': [None, 4], 'num_wings': [np.nan, 0]},ind
    ...: ex=['falcon', 'dog'])                                                  

In [16]: df.isin([np.nan, None])                                                
Out[16]: 
        num_legs  num_wings
falcon      True       True
dog        False      False

In [17]: df.isin([None, None])                                                  
Out[17]: 
        num_legs  num_wings
falcon      True       True
dog        False      False

In [18]: df.isin([np.nan, np.nan])                                              
Out[18]: 
        num_legs  num_wings
falcon      True       True
dog        False      False

But Index case,

In [20]: idx = Index([np.nan,'a','b', None])                                    

In [21]: idx.isin([None])                                                       
Out[21]: array([False, False, False,  True])

In [22]: idx.isin([np.nan])                                                     
Out[22]: array([ True, False, False, False])

Problem description

'.isin' handle NA values differently depending on object types. 'np.nan' and 'None' both are NA value. so, 'Index.isin' do not distinguish 'np.nan' and 'None' like 'DataFrame' case.

Expected Output

In [20]: idx = Index([np.nan,'a','b', None])                                    

In [21]: idx.isin([None])                                                       
Out[21]: array([True, False, False,  True])

In [22]: idx.isin([np.nan])                                                     
Out[22]: array([ True, False, False, True])

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.0.0-37-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : ko_KR.UTF-8
LOCALE : ko_KR.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

@proost proost changed the title BUG: pd.Index.isin inconsistency with missing value BUG: Index.isin inconsistency with missing value Jan 4, 2020
@proost
Copy link
Contributor Author

proost commented Jan 11, 2020

Quite difficult determine it is wrong or not. Because in "Index",

idx = Index([np.nan,'a','b', None])                                    
idx.get_loc(None)

result is 3, Not

array([ True, False, False,  True])

and also "contains" magic function

assert np.datetime64("NaT") in idx

raise exception.
if some wants to check actual "None" value not in terms of NA, then it is more accutrate. But i think "None", "np.nan", "np.nat" as a NA value. but then, maybe needs to change all method in "Index". because as you can see above, "Index" discern "None", "np.nan" ..etc.

So i want to take a any opinion what is more better way.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 5, 2020

I don't think this is a bug. In the DataFrame case, when you assign the None values, the dtype of the columns is inferred to be float64, and the None values get converted to np.nan . But if you create a column with a dtype of object, you can differentiate between None and np.nan. :

In [4]: df = pd.DataFrame({'num_legs': [None, 4], 'num_wings': [np.nan, 0]}, in
   ...: dex=['falcon', 'dog'])
   ...: df
Out[4]:
        num_legs  num_wings
falcon       NaN        NaN
dog          4.0        0.0

In [5]: df['num_heads'] = pd.Series([None, np.nan], dtype=object, index=df.inde
   ...: x)
   ...: df
Out[5]:
        num_legs  num_wings num_heads
falcon       NaN        NaN      None
dog          4.0        0.0       NaN

In [6]: df.dtypes
Out[6]:
num_legs     float64
num_wings    float64
num_heads     object
dtype: object

In [7]: df.isin([None])
Out[7]:
        num_legs  num_wings  num_heads
falcon     False      False       True
dog        False      False      False

In [8]: df.isin([np.nan])
Out[8]:
        num_legs  num_wings  num_heads
falcon      True       True      False
dog        False      False       True

In your example, when you created the Index, it defaulted to a dtype of object.

@Dr-Irv Dr-Irv closed this as completed Sep 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants