BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (#42808) #42944

joelgibson · 2021-08-09T06:38:22Z

When reading a column of mixed booleans and missing values into a specific floating dtype, for example read_csv(..., dtype='float'), missing values were being incorrectly converted to 1.0 rather than NaN. This patch fixes this issue.

closes BUG: read_csv converts NaN to 1.0 in certain circumstances #42808
closes BUG: read_csv returns inconsistent or misleading values with dtype=float #34120
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

I've never been in the CSV parsing code before, so I've tried to put this change in the spot that makes the most sense and causes the least impact, but corrections are very welcome.

Why this issue was happening is that the _try_bool_flex() function parses a mixed column of bools and missing values into a uint8 array, storing False/True/missing as 0/1/255. It then returns a bool view of this array, "hiding" the missing values and making them just look like True. These missing values are meant to be later 'unhidden' by a call to _maybe_upcast(), which views the array as uint8 again, and maps to a dtype=object array containing False/True/NaN.

However, when the user specifies an explicit float dtype, this upcast is skipped since the bool array gets converted to a float array before hitting _maybe_upcast(). This patch just replicates the upcast logic in this case.

Like I said above, this is the spot in the code that made the most sense to me to put the correction - I feel like a more robust fix would be to rewrite the _convert_tokens() function (and the methods it calls, and so on) to actually return the positions of the missing values, as well as the parsed values, which would be a large change.

…andas-dev#42808)

joelgibson · 2021-08-10T05:32:22Z

I've added a case pointed out in the issue for converting mixed-bools-and-nans to int, which was essentially the same issue as for float. It would be nice to actually have all of this type conversion logic in a single place in the parser, rather than spread out over various functions and conditionals. Is the type conversion logic (or rather, just should actually happen) written down anywhere? For instance one thing I found surprising is that when converting a CSV of 0\nNaN\nTrue\nFalse to a dataframe with no dtype, we get back ([nan, True, False], dtype=object), but when actually specifying dtype=object in the call to read_csv we get back ([nan, 'True', 'False'], dtype=object), which is a different result.

jreback · 2021-08-10T21:59:41Z

pandas/_libs/parsers.pyx

+                return col_res, na_count
+
+            # Similar special case for bool => int.
+            if col_res.dtype == np.bool_ and is_integer_dtype(col_dtype):


can you just do a similar check like on L1125? (e.g. astype then check)?

jreback · 2021-08-10T21:59:57Z

cc @gfyoung

gfyoung · 2021-08-10T22:12:38Z

@jreback comment(s) notwithstanding, these changes look good.

phofl · 2021-08-23T13:34:19Z

ping @joelgibson

github-actions · 2021-09-23T00:03:11Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

mroeschke · 2021-10-16T20:41:26Z

Thanks for the pull request, but it appears to have gone stale. Closing for now, but if interested in continuing please address the comments and merge conflict and we can reopen.

joelgibson added 5 commits August 9, 2021 16:25

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (p…

1db0f62

…andas-dev#42808)

Merge remote-tracking branch 'upstream/master' into csv-bool-bug

c950ad3

Merge remote-tracking branch 'upstream/master' into csv-bool-bug

addd297

Updated for the mixed-bool => int case.

da459d1

Merge remote-tracking branch 'origin/csv-bool-bug' into csv-bool-bug

3da1147

jreback requested changes Aug 10, 2021

View reviewed changes

jreback added Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Aug 10, 2021

jreback added this to the 1.4 milestone Aug 10, 2021

mzeitlin11 mentioned this pull request Aug 23, 2021

BUG: read_csv with mixed bools and int sometimes reads 1 as a bool function. #43179

Closed

4 tasks

github-actions bot added the Stale label Sep 23, 2021

mroeschke closed this Oct 16, 2021

phofl mentioned this pull request Dec 15, 2021

BUG: read_csv converting nans to 1 when casting bools to float #44901

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (#42808) #42944

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (#42808) #42944

joelgibson commented Aug 9, 2021 •

edited

Loading

joelgibson commented Aug 10, 2021

jreback Aug 10, 2021

jreback commented Aug 10, 2021

gfyoung commented Aug 10, 2021

phofl commented Aug 23, 2021

github-actions bot commented Sep 23, 2021

mroeschke commented Oct 16, 2021

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (#42808) #42944

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (#42808) #42944

Conversation

joelgibson commented Aug 9, 2021 • edited Loading

joelgibson commented Aug 10, 2021

jreback Aug 10, 2021

Choose a reason for hiding this comment

jreback commented Aug 10, 2021

gfyoung commented Aug 10, 2021

phofl commented Aug 23, 2021

github-actions bot commented Sep 23, 2021

mroeschke commented Oct 16, 2021

joelgibson commented Aug 9, 2021 •

edited

Loading