Skip to content

read_csv: Casting boolean columns as floats turns missing values into 1.0 #16698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stephen-hoover opened this issue Jun 14, 2017 · 2 comments
Closed
Labels
Bug IO CSV read_csv, to_csv

Comments

@stephen-hoover
Copy link
Contributor

Code Sample, a copy-pastable example if possible

In pandas v0.20.2, the following code

import pandas as pd
from io import StringIO
data = "c1,c2\nfalse,1\n,1"
pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']

gives output

0    0.0
1    1.0
Name: c1, dtype: float32

Problem description

In this example, the column of boolean data contains a missing value. If I read the column as booleans (either explicitly via dtype or by allowing pandas to infer the type), then the missing value is given as NaN, as it should be. If I force the column type to be a float (or an integer) via the dtype argument to read_csv, then the missing value is given as 1.0, the same as True.

Expected Output

The output of

pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']

should be the same as the output of

pd.read_csv(StringIO(data))['c1'].astype('float32')

which is

0    0.0
1    NaN
Name: c1, dtype: float32

I.e., the missing value in the input CSV should be cast to NaN rather than 1.0.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.0.7
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.13.0
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jun 15, 2017

yeah this looks like a bug. welcome to have a PR to fix.

@phofl
Copy link
Member

phofl commented Dec 17, 2021

This was fixed in #44901, tests cover this

@phofl phofl closed this as completed Dec 17, 2021
@phofl phofl modified the milestones: Contributions Welcome, 2.0, No action Dec 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants