Skip to content

Fixed read_csv with CategoricalDtype with boolean categories (20498) #20826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Categorical Categorical Data Type labels Apr 25, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Apr 25, 2018
@TomAugspurger TomAugspurger force-pushed the categorical-bool-fixed branch from 6b5cc4f to 5495551 Compare April 25, 2018 22:01
@TomAugspurger TomAugspurger changed the title Fixed read_csv with CategoricalDtype with boolena categories (20498) Fixed read_csv with CategoricalDtype with boolean categories (20498) Apr 26, 2018
@codecov
Copy link

codecov bot commented Apr 26, 2018

Codecov Report

Merging #20826 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20826      +/-   ##
==========================================
+ Coverage   92.29%   92.29%   +<.01%     
==========================================
  Files         161      161              
  Lines       51497    51502       +5     
==========================================
+ Hits        47530    47535       +5     
  Misses       3967     3967
Flag Coverage Δ
#multiple 90.69% <100%> (ø) ⬆️
#single 42.42% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.29% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.38% <100%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5c90e5...f96a854. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 27, 2018 via email

@jreback
Copy link
Contributor

jreback commented Apr 27, 2018

then push to 0.23.1 (alternatively as @jorisvandenbossche mentoined, can make a 0.23rc milestone tag)

@jreback jreback modified the milestones: 0.23.0, 0.23.1 Apr 27, 2018
@jreback jreback removed this from the 0.23.1 milestone Jun 4, 2018
@jorisvandenbossche
Copy link
Member

@TomAugspurger do you have time to look at this one? (I don't know how far off it is)

@teto
Copy link

teto commented Jul 4, 2018

With pandas 0.23, I have the following code where the csv is correctly generated (to_csv) but the read_csv (with appropriate dtypes) will fill the column with NaN

class ConnectionRoles(Enum):
    Client = 0
    Server = 1

dtype_role = pd.api.types.CategoricalDtype(categories=ConnectionRoles, ordered=True)

does this PR address that ?

@TomAugspurger
Copy link
Contributor Author

Possibly. If you can provide a unit test I can include it or you can make a followup PR including it.

@jreback jreback added this to the 0.24.0 milestone Jul 5, 2018
teto added a commit to teto/pymptcpanalyzer that referenced this pull request Aug 6, 2018
... around to_csv/read_csv and pd.api.types.CategoricalDtype

for instance using
pd.api.types.CategoricalDtype(categories=ConnectionRoles, ordered=True)
to_csv will correctly write the column but read_csv will fail expecting
a string pandas-dev/pandas#20826.

I ended up writing my own converter
@jreback
Copy link
Contributor

jreback commented Sep 25, 2018

IIRC this was fine, can you rebase

@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for updating the PR.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2018

this looks good; haven’t looked at what is failing

@jreback
Copy link
Contributor

jreback commented Nov 18, 2018

@TomAugspurger can you merge master

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

@gfyoung if you'd like to rebase this, maybe can get this in

@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2018

Yikes! Looks like this one got lost in the myriad of other PR's. Sure thing.

Previously, was being parsed as object instead of boolean.

Closes pandas-devgh-20498.

Original Author: @TomAugspurger
Rebased by @gfyoung due to merge conflicts.
@gfyoung gfyoung force-pushed the categorical-bool-fixed branch from a1f64ff to f96a854 Compare November 23, 2018 10:39
@gfyoung
Copy link
Member

gfyoung commented Nov 23, 2018

@jreback @TomAugspurger : Rebased this PR. Wanted to note a couple of things (also helps to address many of the outstanding comments that were still current):

  • I personally was not a fan of moving an entire Categorical method into parsers.pyx. That IMO seems to do the kind of conflating that was pushed back against earlier. I was able to get things to work without making this migration. At the same time, the "parsing" of true_values is relegated to parsers.pyx before being passed to Categorical._from_inferred_categories.

One extra thing I had to do was check for CategoricalDtype before attempting to do dtype inference with other types (e.g. bool, int). This was needed to get all engines to pass.

  • I'm okay with having a single whatsnew entry line. I reworded it a little to make things clearer.

  • Not sure we need a version added on the docstring, given that Categorical._from_inferred_categories is a private method by naming (starts with an underscore).

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

looks ok to me. @TomAugspurger if you'd have a look.

@gfyoung
Copy link
Member

gfyoung commented Nov 27, 2018

@TomAugspurger : Any thoughts on this? I think we're waiting on your review before merging.

@TomAugspurger
Copy link
Contributor Author

Sorry for the delay, thanks for rebasing and fixing things up.

@TomAugspurger TomAugspurger merged commit d53a4cc into pandas-dev:master Nov 27, 2018
@TomAugspurger TomAugspurger deleted the categorical-bool-fixed branch November 27, 2018 16:17
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Previously, was being parsed as object instead of boolean.

Closes pandas-devgh-20498.

Original Author: @TomAugspurger
Rebased by @gfyoung due to merge conflicts.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Previously, was being parsed as object instead of boolean.

Closes pandas-devgh-20498.

Original Author: @TomAugspurger
Rebased by @gfyoung due to merge conflicts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Force boolean column to category while reading a csv
6 participants