Skip to content

Unexpected behavior with NaN values on str operators with categorical data #23602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
weston-nrl opened this issue Nov 9, 2018 · 1 comment
Closed
Labels
Duplicate Report Duplicate issue or pull request

Comments

@weston-nrl
Copy link

weston-nrl commented Nov 9, 2018

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
s=pd.Series(['a', 'b', np.nan])
s=s.astype('category')
print(s.str.startswith('a', na=False))
print(s.str.endswith('a', na=False))
print(s.str.contains('a', na=False))

Problem description

The above code works as expected if you comment out the 4th line (astype('category')). With str.startswith, str.endswith, and str,contains, the "na=False" option works to output False for the NaN value. However, with that astype('category') line making the series categorical, the "na=False" option seems to be ignored and NaN is output instead for the NaN value. This makes the output difficult to use e.g. as a mask for slicing data.

Expected Output

0 True
1 False
2 False
dtype: bool
0 True
1 False
2 False
dtype: bool
0 True
1 False
2 False
dtype: bool

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-138-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 20.7.0
Cython: 0.29
numpy: 1.11.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: 1.8.1
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Duplicate of #22158 I think (didn't look too closely; let me know if I'm wrong). #22170 is open to fix this (which is getting close to going stale. Keep an eye on it, and if you want you can take over the PR).

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Nov 9, 2018
@TomAugspurger TomAugspurger added this to the No action milestone Nov 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

2 participants