Skip to content

documentation of read_csv producing NaN floats in string column #20875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jowagner opened this issue Apr 30, 2018 · 4 comments
Closed

documentation of read_csv producing NaN floats in string column #20875

jowagner opened this issue Apr 30, 2018 · 4 comments
Labels
Docs IO CSV read_csv, to_csv
Milestone

Comments

@jowagner
Copy link
Contributor

jowagner commented Apr 30, 2018

There are a good few issue reports around the surprising behaviour of read_csv when reading empty cells and/or cells containing one of the special NaN strings with dtype=str, including n/a, null and NA. This behaviour should be documented in the description of the "dtype" parameter as this is what most people will read who encounter a type error. Ideally, there should also be pointers to workarounds, for example setting na_values = [] and keep_default_na = False seems to fix the problem with non-empty strings.

Related (incomplete list): issue #4849, issue #10205, issue #10647, issue #16569, issue #15669

@WillAyd
Copy link
Member

WillAyd commented Apr 30, 2018

Most of the linked issues have been closed and the read_csv documentation touches on this information in the na_values parameter description.

Maybe I'm just misunderstanding what you are asking for - do you have a code example of the behavior you feel isn't documented?

@TomAugspurger
Copy link
Contributor

Closed by #20895.

@DanTup
Copy link

DanTup commented Jul 1, 2019

FWIW I'm a complete Python/pandas noob and I just hit this. Although the docs do mention na_values, I didn't really understand from that what was going on. I parsed a CSV and set dtype=np.str but when I tried to use the column in some Kera code I got AttributeError: 'float' object has no attribute 'lower'.

It seems counter-intuitive when setting the column type to string that the column ends up with floats (they're NaNs, but that's still breaking for something expecting strings), but if it's intended, I don't think the docs make it very clear (at least for the inexperienced like me).

Edit: Passing na_filter=False seems to provide the behaviour I'd expect, so maybe the dtype docs should mention that in addition to na_values?

@jowagner
Copy link
Contributor Author

jowagner commented Aug 6, 2019

@DanTup Here's how to make it easy for the developers to accept this change:

  1. Fork the pandas repo
  2. Locate the same files I changed in https://github.com/jowagner/pandas/pull/1/files
  3. Make your changes, e.g. adding "or na_filter" between "na_values" and "settings". In the github interface, this asks you to create a new branch. That's fine. Call it for example readcsv-dtype.
  4. Make a pull request master <-- readcsv-dtype
  5. Accept the request
  6. Make a pull request to pandas-dev:master (I don't remember whether I did this from my master or from the new branch. Steps 4 and 5 may not be needed.)
  7. Watch your e-mail or github notifications in case the developers want you to make changes to your pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants