Skip to content

read_csv ignores dtype for bool columns with missing values #20591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
finnhacks42 opened this issue Apr 3, 2018 · 2 comments · Fixed by #23968
Closed

read_csv ignores dtype for bool columns with missing values #20591

finnhacks42 opened this issue Apr 3, 2018 · 2 comments · Fixed by #23968
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@finnhacks42
Copy link

finnhacks42 commented Apr 3, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.compat import StringIO
data = "false,1\n,1\ntrue,"
df = pd.read_csv(StringIO(data), header=None, names=['a','b'], dtype={'a': 'bool'}, engine='c')
print(df['a'].dtype)
print(df)
outputs:

object
a b
0 False 1.0
1 NaN 1.0
2 True NaN

import pandas as pd
from pandas.compat import StringIO
data = "false,1\n,1\ntrue,"
df = pd.read_csv(StringIO(data), header=None, names=['a','b'], dtype={'a': 'bool'},engine='python')
print (df['a'].dtype)
print(df)
outputs:

bool
a b
0 False 1.0
1 True 1.0
2 True NaN

Problem description

Missing values cannot be coerced into a column of dtype bool. The expected behaviour, to be consistent with the analogous case for integers, is to throw a ValueError. The user is then aware of the issue and can specify a converter or parse the boolean as a float (subject to #16698).

Instead the c engine silently ignores the requested dtype and returns a column of type object. This behaviour is especially confusing on large datasets with low_memory=True as you get the warning: Column 1 has mixed types. Specify dtype option on import or set low_memory=False.

The python engine silently converts all missing values to True and returns a column of dtype bool.

Expected Output

ValueError('Boolean column has NA values in column 1')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+725.gf67c6fa80
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.3.0
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.6
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.4
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 4, 2018

Yes, I agree this should match the integer behavior, thanks for the report, PR welcome!

df = pd.read_csv(StringIO(data), header=None, names=['a','b'], dtype={'a': 'bool', 'b': 'int64'}, engine='c')

ValueError                                Traceback (most recent call last)
<snip>
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
ValueError: Integer column has NA values in column 1

@chris-b1 chris-b1 added this to the Next Major Release milestone Apr 4, 2018
@chris-b1 chris-b1 added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Difficulty Intermediate labels Apr 4, 2018
@atulagrwl
Copy link

Here is the current behavior

Bool Value Engine = C Engine = Python
NA NaN True
0 False False
1 True True
-1 error True
100 error True
abc error True

What is expected behavior for above cases? @chris-b1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
4 participants