documentation of read_csv producing NaN floats in string column #20875

jowagner · 2018-04-30T09:13:54Z

There are a good few issue reports around the surprising behaviour of read_csv when reading empty cells and/or cells containing one of the special NaN strings with dtype=str, including n/a, null and NA. This behaviour should be documented in the description of the "dtype" parameter as this is what most people will read who encounter a type error. Ideally, there should also be pointers to workarounds, for example setting na_values = [] and keep_default_na = False seems to fix the problem with non-empty strings.

Related (incomplete list): issue #4849, issue #10205, issue #10647, issue #16569, issue #15669

WillAyd · 2018-04-30T18:30:59Z

Most of the linked issues have been closed and the read_csv documentation touches on this information in the na_values parameter description.

Maybe I'm just misunderstanding what you are asking for - do you have a code example of the behavior you feel isn't documented?

TomAugspurger · 2018-05-14T19:16:42Z

Closed by #20895.

DanTup · 2019-07-01T15:58:15Z

FWIW I'm a complete Python/pandas noob and I just hit this. Although the docs do mention na_values, I didn't really understand from that what was going on. I parsed a CSV and set dtype=np.str but when I tried to use the column in some Kera code I got AttributeError: 'float' object has no attribute 'lower'.

It seems counter-intuitive when setting the column type to string that the column ends up with floats (they're NaNs, but that's still breaking for something expecting strings), but if it's intended, I don't think the docs make it very clear (at least for the inexperienced like me).

Edit: Passing na_filter=False seems to provide the behaviour I'd expect, so maybe the dtype docs should mention that in addition to na_values?

jowagner · 2019-08-06T17:38:03Z

@DanTup Here's how to make it easy for the developers to accept this change:

Fork the pandas repo
Locate the same files I changed in https://github.com/jowagner/pandas/pull/1/files
Make your changes, e.g. adding "or na_filter" between "na_values" and "settings". In the github interface, this asks you to create a new branch. That's fine. Call it for example readcsv-dtype.
Make a pull request master <-- readcsv-dtype
Accept the request
Make a pull request to pandas-dev:master (I don't remember whether I did this from my master or from the new branch. Steps 4 and 5 may not be needed.)
Watch your e-mail or github notifications in case the developers want you to make changes to your pull request

This was referenced May 1, 2018

feature request: allow NaN handling to be specified per column in read_csv() #20877

Closed

Mention NaN handling in dtype description jowagner/pandas#1

Merged

Mention NaN handling in dtype description #20895

Merged

jreback added Docs IO CSV read_csv, to_csv labels May 1, 2018

jreback added this to the 0.23.0 milestone May 1, 2018

jreback modified the milestones: 0.23.0, Next Major Release May 9, 2018

TomAugspurger closed this as completed May 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation of read_csv producing NaN floats in string column #20875

documentation of read_csv producing NaN floats in string column #20875

jowagner commented Apr 30, 2018 •

edited

Loading

WillAyd commented Apr 30, 2018

TomAugspurger commented May 14, 2018

DanTup commented Jul 1, 2019 •

edited

Loading

jowagner commented Aug 6, 2019

documentation of read_csv producing NaN floats in string column #20875

documentation of read_csv producing NaN floats in string column #20875

Comments

jowagner commented Apr 30, 2018 • edited Loading

WillAyd commented Apr 30, 2018

TomAugspurger commented May 14, 2018

DanTup commented Jul 1, 2019 • edited Loading

jowagner commented Aug 6, 2019

jowagner commented Apr 30, 2018 •

edited

Loading

DanTup commented Jul 1, 2019 •

edited

Loading