Skip to content

BUG: ordered categorical comparison with missing values evaluates to True #26504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mzwiessele opened this issue May 23, 2019 · 3 comments · Fixed by #26514
Closed

BUG: ordered categorical comparison with missing values evaluates to True #26504

mzwiessele opened this issue May 23, 2019 · 3 comments · Fixed by #26514
Labels
Categorical Categorical Data Type Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@mzwiessele
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.Categorical(["1", "2", "3", None], categories=["1", "2", "3"], ordered=True) <= "2"
# => array([ True,  True, False,  True])

Problem description

Here a missing entry is being evaluated as None <= "2" == True. Shouldn't missing values always be evaluate to False in any comparison?
I think this is related to #4537

Expected Output

import pandas as pd
pd.Categorical(["1", "2", "3", None], categories=["1", "2", "3"], ordered=True) <= "2"
# => array([ True,  True, False,  False])

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 3.6.2
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.3
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.7.5
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.5.4
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.5
lxml.etree: 4.2.2
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.8
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

Yep, should probably be False. It looks like we correctly mask for categorical : categorical comparisons, but not categorical : scalar.

In [16]: c
Out[16]:
[1, 2, 3, NaN]
Categories (3, object): [1 < 2 < 3]

In [17]: c <= c
Out[17]: array([ True,  True,  True, False])

This fix will be somewhere in

if is_scalar(other):
, similar to what we do above.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Numeric Operations Arithmetic, Comparison, and Logical operations labels May 23, 2019
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone May 23, 2019
@shreyateeza
Copy link

@TomAugspurger I am new here. I would like to contribute to this issue. Can you tell me what needs to be done?

@shantanu-gontia
Copy link
Contributor

shantanu-gontia commented May 24, 2019

The Categorical object has the _codes property, which is used for comparison.

In [1]: import pandas as pd

In [2]: a = pd.Categorical([1, 2, 3, 4], categories=[1, 2, 3], ordered=Tr
   ...: ue)

In [3]: a._codes
Out[3]: array([ 0,  1,  2, -1], dtype=int8)

In [4]: a._codes < 2
Out[4]: array([ True,  True, False,  True])

The _codes takes NaN to be -1 leading to incorrect comparison


Perhaps the fix lies in correct construction of the Categorical object and not in the comparison itself.

another-green added a commit to another-green/pandas that referenced this issue May 24, 2019
another-green added a commit to another-green/pandas that referenced this issue May 24, 2019
another-green added a commit to another-green/pandas that referenced this issue May 24, 2019
another-green added a commit to another-green/pandas that referenced this issue May 25, 2019
@jorisvandenbossche jorisvandenbossche changed the title BUG: None comparison evaluates to True BUG: ordered categorical comparison with missing values evaluates to True May 28, 2019
another-green added a commit to another-green/pandas that referenced this issue May 29, 2019
@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 May 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants