Skip to content

GH20591 read_csv raise ValueError for bool columns with missing values (C engine) #23968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 2, 2018

Conversation

JustinZhengBC
Copy link
Contributor

@JustinZhengBC JustinZhengBC commented Nov 28, 2018

As explained in the referenced issue, trying to cast missing values to bool dtype should result in a ValueError similar to when casting to int

@pep8speaks
Copy link

pep8speaks commented Nov 28, 2018

Hello @JustinZhengBC! Thanks for updating the PR.

Comment last updated on November 29, 2018 at 07:14 Hours UTC

@codecov
Copy link

codecov bot commented Nov 28, 2018

Codecov Report

Merging #23968 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23968      +/-   ##
==========================================
- Coverage   92.31%   92.31%   -0.01%     
==========================================
  Files         161      161              
  Lines       51513    51554      +41     
==========================================
+ Hits        47554    47591      +37     
- Misses       3959     3963       +4
Flag Coverage Δ
#multiple 90.71% <100%> (-0.01%) ⬇️
#single 42.44% <0%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.38% <100%> (+0.01%) ⬆️
pandas/core/arrays/timedeltas.py 96% <0%> (-0.26%) ⬇️
pandas/core/indexes/base.py 96.32% <0%> (-0.17%) ⬇️
pandas/core/config.py 87.04% <0%> (-0.13%) ⬇️
pandas/io/sas/sas_xport.py 90.14% <0%> (-0.1%) ⬇️
pandas/io/formats/printing.py 93.01% <0%> (-0.08%) ⬇️
pandas/plotting/_core.py 83.58% <0%> (-0.05%) ⬇️
pandas/core/computation/align.py 97.84% <0%> (-0.05%) ⬇️
pandas/core/reshape/melt.py 97.5% <0%> (-0.05%) ⬇️
pandas/core/dtypes/concat.py 96.63% <0%> (-0.04%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43b2dab...b8bca36. Read the comment docs.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv labels Nov 28, 2018
@JustinZhengBC JustinZhengBC force-pushed the BUG-20591 branch 2 times, most recently from ea1ec6f to b176a72 Compare November 28, 2018 19:42
@JustinZhengBC JustinZhengBC changed the title GH20591 read_csv raise ValueError for bool columns with missing values GH20591 read_csv raise ValueError for bool columns with missing values (C engine) Nov 28, 2018
@JustinZhengBC
Copy link
Contributor Author

Currently the python engine delegates type conversion to np.ndarray.astype in this case. Should it be overridden in this case for consistency with the C engine, or left as is so numpy users don't get unexpected behaviour?

@jreback jreback added this to the 0.24.0 milestone Nov 28, 2018
@jreback
Copy link
Contributor

jreback commented Nov 28, 2018

Currently the python engine delegates type conversion to np.ndarray.astype in this case. Should it be overridden in this case for consistency with the C engine, or left as is so numpy users don't get unexpected behaviour?

@gfyoung ?

@gfyoung
Copy link
Member

gfyoung commented Nov 28, 2018

Currently the python engine delegates type conversion to np.ndarray.astype in this case. Should it be overridden in this case for consistency with the C engine, or left as is so numpy users don't get unexpected behaviour?

I would consider current behavior to be a bug. Let's override so that we have consistent behavior.

@JustinZhengBC JustinZhengBC force-pushed the BUG-20591 branch 2 times, most recently from 26804d5 to 6c67491 Compare November 29, 2018 00:53
@JustinZhengBC
Copy link
Contributor Author

@gfyoung I managed to do it in _convert_data. It requires iterating over the columns of data though (which happens again in _convert_to_ndarrays). I needed the try/catch block because is_bool_dtype throws ValueErrors on the line dtype._is_boolean when given inputs like "category" and "foo". Is that intended behaviour?

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback merged commit fe2969e into pandas-dev:master Dec 2, 2018
@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

thanks @JustinZhengBC

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv ignores dtype for bool columns with missing values
4 participants