Unexpected behavior with NaN values on str operators with categorical data #23602

weston-nrl · 2018-11-09T15:49:50Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
s=pd.Series(['a', 'b', np.nan])
s=s.astype('category')
print(s.str.startswith('a', na=False))
print(s.str.endswith('a', na=False))
print(s.str.contains('a', na=False))

Problem description

The above code works as expected if you comment out the 4th line (astype('category')). With str.startswith, str.endswith, and str,contains, the "na=False" option works to output False for the NaN value. However, with that astype('category') line making the series categorical, the "na=False" option seems to be ignored and NaN is output instead for the NaN value. This makes the output difficult to use e.g. as a mask for slicing data.

Expected Output

0 True
1 False
2 False
dtype: bool
0 True
1 False
2 False
dtype: bool
0 True
1 False
2 False
dtype: bool

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-138-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 20.7.0
Cython: 0.29
numpy: 1.11.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: 1.8.1
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-09T16:14:29Z

Duplicate of #22158 I think (didn't look too closely; let me know if I'm wrong). #22170 is open to fix this (which is getting close to going stale. Keep an eye on it, and if you want you can take over the PR).

TomAugspurger closed this as completed Nov 9, 2018

TomAugspurger added the Duplicate Report Duplicate issue or pull request label Nov 9, 2018

TomAugspurger added this to the No action milestone Nov 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior with NaN values on str operators with categorical data #23602

Unexpected behavior with NaN values on str operators with categorical data #23602

weston-nrl commented Nov 9, 2018 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

TomAugspurger commented Nov 9, 2018

Unexpected behavior with NaN values on str operators with categorical data #23602

Unexpected behavior with NaN values on str operators with categorical data #23602

Comments

weston-nrl commented Nov 9, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

TomAugspurger commented Nov 9, 2018

weston-nrl commented Nov 9, 2018 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS