Skip to content

BUG: read_csv converting nans to 1 when casting bools to float #44901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -753,6 +753,7 @@ I/O
- Bug in :func:`read_csv` not replacing ``NaN`` values with ``np.nan`` before attempting date conversion (:issue:`26203`)
- Bug in :func:`read_csv` raising ``AttributeError`` when attempting to read a .csv file and infer index column dtype from an nullable integer type (:issue:`44079`)
- :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` with ``compression`` set to ``'zip'`` no longer create a zip file containing a file ending with ".zip". Instead, they try to infer the inner file name more smartly. (:issue:`39465`)
- Bug in :func:`read_csv` where reading a mixed column of booleans and missing values to a float type results in the missing values becoming 1.0 rather than NaN (:issue:`42808`, :issue:`34120`)
- Bug in :func:`read_csv` when passing simultaneously a parser in ``date_parser`` and ``parse_dates=False``, the parsing was still called (:issue:`44366`)
- Bug in :func:`read_csv` silently ignoring errors when failling to create a memory-mapped file (:issue:`44766`)
- Bug in :func:`read_csv` when passing a ``tempfile.SpooledTemporaryFile`` opened in binary mode (:issue:`44748`)
Expand Down
21 changes: 20 additions & 1 deletion pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1093,8 +1093,27 @@ cdef class TextReader:
break

# we had a fallback parse on the dtype, so now try to cast
# only allow safe casts, eg. with a nan you cannot safely cast to int
if col_res is not None and col_dtype is not None:
# If col_res is bool, it might actually be a bool array mixed with NaNs
# (see _try_bool_flex()). Usually this would be taken care of using
# _maybe_upcast(), but if col_dtype is a floating type we should just
# take care of that cast here.
if col_res.dtype == np.bool_ and is_float_dtype(col_dtype):
mask = col_res.view(np.uint8) == na_values[np.uint8]
col_res = col_res.astype(col_dtype)
np.putmask(col_res, mask, np.nan)
return col_res, na_count

# NaNs are already cast to True here, so can not use astype
if col_res.dtype == np.bool_ and is_integer_dtype(col_dtype):
if na_count > 0:
raise ValueError(
f"cannot safely convert passed user dtype of "
f"{col_dtype} for {np.bool_} dtyped data in "
f"column {i} due to NA values"
)

# only allow safe casts, eg. with a nan you cannot safely cast to int
try:
col_res = col_res.astype(col_dtype, casting='safe')
except TypeError:
Expand Down
6 changes: 5 additions & 1 deletion pandas/io/parsers/arrow_parser_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,11 @@ def _finalize_output(self, frame: DataFrame) -> DataFrame:
frame.index.names = [None] * len(frame.index.names)

if self.kwds.get("dtype") is not None:
frame = frame.astype(self.kwds.get("dtype"))
try:
frame = frame.astype(self.kwds.get("dtype"))
except TypeError as e:
# GH#44901 reraise to keep api consistent
raise ValueError(e)
return frame

def read(self) -> DataFrame:
Expand Down
39 changes: 39 additions & 0 deletions pandas/tests/io/parser/test_na_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import pandas._testing as tm

skip_pyarrow = pytest.mark.usefixtures("pyarrow_skip")
xfail_pyarrow = pytest.mark.usefixtures("pyarrow_xfail")


@skip_pyarrow
Expand Down Expand Up @@ -615,3 +616,41 @@ def test_nan_multi_index(all_parsers):
)

tm.assert_frame_equal(result, expected)


@xfail_pyarrow
def test_bool_and_nan_to_bool(all_parsers):
# GH#42808
parser = all_parsers
data = """0
NaN
True
False
"""
with pytest.raises(ValueError, match="NA values"):
parser.read_csv(StringIO(data), dtype="bool")


def test_bool_and_nan_to_int(all_parsers):
# GH#42808
parser = all_parsers
data = """0
NaN
True
False
"""
with pytest.raises(ValueError, match="convert|NoneType"):
parser.read_csv(StringIO(data), dtype="int")


def test_bool_and_nan_to_float(all_parsers):
# GH#42808
parser = all_parsers
data = """0
NaN
True
False
"""
result = parser.read_csv(StringIO(data), dtype="float")
expected = DataFrame.from_dict({"0": [np.nan, 1.0, 0.0]})
tm.assert_frame_equal(result, expected)