Skip to content

BUG: read_csv with mixed bools and NaNs sometimes reads NaNs as 1.0 (#42808) #42944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

joelgibson
Copy link
Contributor

@joelgibson joelgibson commented Aug 9, 2021

When reading a column of mixed booleans and missing values into a specific floating dtype, for example read_csv(..., dtype='float'), missing values were being incorrectly converted to 1.0 rather than NaN. This patch fixes this issue.

I've never been in the CSV parsing code before, so I've tried to put this change in the spot that makes the most sense and causes the least impact, but corrections are very welcome.

Why this issue was happening is that the _try_bool_flex() function parses a mixed column of bools and missing values into a uint8 array, storing False/True/missing as 0/1/255. It then returns a bool view of this array, "hiding" the missing values and making them just look like True. These missing values are meant to be later 'unhidden' by a call to _maybe_upcast(), which views the array as uint8 again, and maps to a dtype=object array containing False/True/NaN.

However, when the user specifies an explicit float dtype, this upcast is skipped since the bool array gets converted to a float array before hitting _maybe_upcast(). This patch just replicates the upcast logic in this case.

Like I said above, this is the spot in the code that made the most sense to me to put the correction - I feel like a more robust fix would be to rewrite the _convert_tokens() function (and the methods it calls, and so on) to actually return the positions of the missing values, as well as the parsed values, which would be a large change.

@joelgibson
Copy link
Contributor Author

I've added a case pointed out in the issue for converting mixed-bools-and-nans to int, which was essentially the same issue as for float. It would be nice to actually have all of this type conversion logic in a single place in the parser, rather than spread out over various functions and conditionals. Is the type conversion logic (or rather, just should actually happen) written down anywhere? For instance one thing I found surprising is that when converting a CSV of 0\nNaN\nTrue\nFalse to a dataframe with no dtype, we get back ([nan, True, False], dtype=object), but when actually specifying dtype=object in the call to read_csv we get back ([nan, 'True', 'False'], dtype=object), which is a different result.

return col_res, na_count

# Similar special case for bool => int.
if col_res.dtype == np.bool_ and is_integer_dtype(col_dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just do a similar check like on L1125? (e.g. astype then check)?

@jreback
Copy link
Contributor

jreback commented Aug 10, 2021

cc @gfyoung

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Aug 10, 2021
@jreback jreback added this to the 1.4 milestone Aug 10, 2021
@gfyoung
Copy link
Member

gfyoung commented Aug 10, 2021

@jreback comment(s) notwithstanding, these changes look good.

@phofl
Copy link
Member

phofl commented Aug 23, 2021

ping @joelgibson

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 23, 2021
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. Closing for now, but if interested in continuing please address the comments and merge conflict and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv Stale
Projects
None yet
5 participants