Skip to content

BUG: Categorical series .isin({pd.NaT, ...}) gives all-False #36550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
adamhooper opened this issue Sep 22, 2020 · 3 comments · Fixed by #51587
Closed
2 of 3 tasks

BUG: Categorical series .isin({pd.NaT, ...}) gives all-False #36550

adamhooper opened this issue Sep 22, 2020 · 3 comments · Fixed by #51587
Labels
Bug Categorical Categorical Data Type isin isin method Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@adamhooper
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
pd.Series([pd.NaT, ''], dtype='category').isin({pd.NaT, ''})  # [True, True]
pd.Series([''], dtype='category').isin({''})  # [True] because '' is in series
pd.Series([''], dtype='category').isin({None, ''})  # [True] because '' is in series
pd.Series([''], dtype='category').isin({pd.NaT, ''})  # [False] even though '' is in series

Problem description

When searching for empty values in a categorical full of empty values, searching for pd.NaT will prevent other terms from being found.

(Perhaps that's because Pandas casts all input values to Timedelta?)

This behavior is unexpected. It makes me feel like I must be crazy. If I'm not meant to search for pd.NaT, I'd feel very happy with an error message when I try this.

Expected Output

pd.Series([''], dtype='category').isin({pd.NaT, ''})  # [True] because '' is in series

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.4-200.fc32.x86_64
Version : #1 SMP Wed Aug 26 22:28:08 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.2
numpy : 1.19.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 45.2.0
Cython : None
pytest : 5.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@adamhooper adamhooper added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 22, 2020
adamhooper added a commit to CJWorkbench/reshape that referenced this issue Sep 22, 2020
Previously, "" wouldn't be found in a categorical because of Pandas
bug pandas-dev/pandas#36550

[finishes #174929289]

Tests and code have been updated to conform to Pandas 1.1 reshape
behavior, which reshapes Categorical to Categorical. This should be
backwards-compatible with Pandas 0.25 on production.
@dsaxton dsaxton added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 22, 2020
@dsaxton
Copy link
Member

dsaxton commented Sep 22, 2020

@adamhooper Thanks, seems there is a bug here. The value doesn't have to be pd.NaT, it can be other datetime-like things, so maybe your suspicion is right that the empty string is being converted somewhere (other values don't seem to have the same effect):

[ins] In [1]: import numpy as np
         ...: import pandas as pd
         ...:
         ...:
         ...: value = ""
         ...:
         ...: s = pd.Series([value], dtype="category")
         ...: print(s.isin([pd.Timedelta(0), value]))
         ...: print(pd.__version__)
         ...:
0    False
dtype: bool
1.2.0.dev0+469.gf04624104

@dsaxton dsaxton added this to the Contributions Welcome milestone Sep 22, 2020
@samilAyoub
Copy link
Contributor

take

@samilAyoub samilAyoub removed their assignment Sep 22, 2020
@jbrockmendel jbrockmendel added the isin isin method label Oct 30, 2020
@mroeschke mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Aug 13, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@topper-123
Copy link
Contributor

topper-123 commented Feb 23, 2023

This works as expected in main (v2.0rc1):

>>> import pandas as pd
>>>
>>> pd.Series([pd.NaT, ''], dtype='category').isin({pd.NaT, ''})
0    True
1    True
dtype: bool
>>> pd.Series([''], dtype='category').isin({''})
0    True
dtype: bool
>>> pd.Series([''], dtype='category').isin({None, ''})
0    True
dtype: bool
>>> pd.Series([''], dtype='category').isin({pd.NaT, ''})  # this gave [False] before
0    True
dtype: bool

I'll add a test for this so it doesn't pop up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type isin isin method Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants