Skip to content

Categorical equality with NaN behaves unexpectedly #28384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ageorgou opened this issue Sep 11, 2019 · 7 comments · Fixed by #36520
Closed

Categorical equality with NaN behaves unexpectedly #28384

ageorgou opened this issue Sep 11, 2019 · 7 comments · Fixed by #36520
Assignees
Labels
Categorical Categorical Data Type good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@ageorgou
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd

cat_type = pd.CategoricalDtype(categories=["a", "b", "c"])
s1_cat = pd.Series(["a", "b", "c"], dtype=cat_type)
s2_cat = pd.Series([np.nan, "a", "b"], dtype=cat_type)
s1_cat != s2_cat  # expected all True

Output:

0    False
1     True
2     True
dtype: bool

Problem description

I would expect anything to compare != to NaN.

Comparing the categorical series with == works as expected:

s1_cat == s2_cat  # expect all False

Output:

0    False
1    False
2    False
dtype: bool

Element-wise comparison seems to work fine:

for left, right in zip(s1_cat, s2_cat):
    print(left != right)  # expect all True

Output:

True
True
True

This also works as expected with other data types, e.g.:

s1_int = pd.Series([1, 2, 3])
s2_int = pd.Series([np.nan, 1, 2])
s1_int != s2_int  # expect all True

gives

0    True
1    True
2    True
dtype: bool

Apologies if this is a duplicate; I have found various issues about categorical types and missing values, but not about comparing them in this way.

Expected Output

s1_cat != s2_cat  # expected all True
0    True
1    True
2    True
dtype: bool

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 16.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@TomAugspurger
Copy link
Contributor

cc @jorisvandenbossche (gets into the NA vs. NaN discussion as part of #28095).

@WillAyd
Copy link
Member

WillAyd commented Sep 11, 2019

This is standard behavior per IEE 7544

https://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values

So I think nothing to be done directly on this issue. If you have feedback on generally how missing values should be handled I would suggest rolling into linked issue above

@WillAyd WillAyd added this to the No action milestone Sep 11, 2019
@WillAyd WillAyd closed this as completed Sep 11, 2019
@jorisvandenbossche
Copy link
Member

@WillAyd as noted in the issue, for other dtypes it does equate to True, as well for scalars:

In [9]: np.nan != np.nan  
Out[9]: True

@jorisvandenbossche
Copy link
Member

To make the difference in behaviour between categorical and non-categorical very explicit:

In [14]: pd.Series(['a', 'b', np.nan]) != pd.Series(['b', 'a', 'a'])
Out[14]: 
0    True
1    True
2    True
dtype: bool

In [15]: pd.Series(['a', 'b', np.nan], dtype='category') != pd.Series(['b', 'a', 'a'], dtype='category') 
Out[15]: 
0     True
1     True
2    False
dtype: bool

I suppose this has to do with np.nan being represented in categorical as -1, and this might not be special cased.

@jorisvandenbossche jorisvandenbossche removed this from the No action milestone Sep 12, 2019
@jorisvandenbossche jorisvandenbossche added Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Usage Question labels Sep 12, 2019
@quale1
Copy link

quale1 commented May 2, 2020

Just to further emphasize the problems this causes,

In [3]: x = pd.Series(['a', 'b', np.nan], dtype='category')

In [4]: y = pd.Series(['b', 'a', 'a'], dtype='category')

In [5]: x[2] != y[2]
Out[5]: True

In [6]: (x != y)[2]
Out[6]: False

Result 5 makes sense, but 6 doesn't. It's impossible to successfully reason about a Pandas program when x[2] != y[2] and (x != y)[2] give different results.

@simonjayhawkins
Copy link
Member

looks to be fixed on master (@jbrockmendel maybe #36237?)

>>> pd.__version__
'1.2.0.dev0+378.g8df0218a4b'
>>> pd.Series(['a', 'b', np.nan], dtype='category') != pd.Series(['b', 'a', 'a'], dtype='category')
0    True
1    True
2    True
dtype: bool
>>>

@simonjayhawkins simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 14, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Sep 14, 2020
@junjunjunk
Copy link
Contributor

take

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Sep 22, 2020
@jreback jreback added the Categorical Categorical Data Type label Sep 22, 2020
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants