Inconsistent behaviour of read_csv() between python and c engine with null values. #23056

k-aziz · 2018-10-09T12:30:10Z

Code Sample

Using pandas v0.23.4

test.csv file

A,B,C
0,,2
,1,2
,,
a,2,

>>> df = pd.read_csv('test.csv', engine='python', dtype={'B': str})
>>> df
     A    B    C
0    0  nan  2.0
1  NaN    1  2.0
2  NaN  nan  NaN
3    a    2  NaN

>>> df.isnull()
       A      B      C
0  False  False  False
1   True  False  False
2   True  False   True
3  False  False   True

The c engine is used here.

>>> df = pd.read_csv('test.csv', engine='c', dtype={'B': str})
>>> df
     A    B    C
0    0  NaN  2.0
1  NaN    1  2.0
2  NaN  NaN  NaN
3    a    2  NaN

>>> df.isnull()
       A      B      C
0  False   True  False
1   True  False  False
2   True   True   True
3  False  False   True

Problem description

When using dtype to convert a column to string, the empty values are not shown to be null when running df.isnull() if the python engine is used with read_csv().

This is inconsistent with the c engine which I believe has the correct behaviour of identifying these values as null. This also causes issues when working with null values e.g. dropnull() does not drop these rows.

Expected Output

>>> df = pd.read_csv('test.csv', engine='python', dtype={'B': str})
>>> df
     A    B    C
0    0  NaN  2.0
1  NaN    1  2.0
2  NaN  NaN  NaN
3    a    2  NaN

>>> df.isnull()
       A      B      C
0  False   True  False
1   True  False  False
2   True   True   True
3  False  False   True

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 18.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_GB.UTF-8
pandas: 0.23.4
pytest: 3.8.0
pip: 9.0.1
setuptools: 38.2.4
Cython: None
numpy: 1.15.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.15
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-10-09T22:20:02Z

Closing as a duplicate of #21131

WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO CSV read_csv, to_csv labels Oct 9, 2018

WillAyd closed this as completed Oct 9, 2018

WillAyd added the Duplicate Report Duplicate issue or pull request label Oct 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour of read_csv() between python and c engine with null values. #23056

Inconsistent behaviour of read_csv() between python and c engine with null values. #23056

k-aziz commented Oct 9, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Oct 9, 2018

Inconsistent behaviour of read_csv() between python and c engine with null values. #23056

Inconsistent behaviour of read_csv() between python and c engine with null values. #23056

Comments

k-aziz commented Oct 9, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Oct 9, 2018

k-aziz commented Oct 9, 2018 •

edited

Loading

Output of `pd.show_versions()`