Skip to content

read_csv() ignores na_filter=False for index columns #7518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dlenski opened this issue Jun 19, 2014 · 6 comments
Closed

read_csv() ignores na_filter=False for index columns #7518

dlenski opened this issue Jun 19, 2014 · 6 comments
Labels
Bug Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv

Comments

@dlenski
Copy link

dlenski commented Jun 19, 2014

Using 0.14.0. pandas.io.parsers.read_csv is supposed to ignore blank-looking values if na_filter=False, but it does not do this for index_col columns.

foo.csv:

fruit,size,sugar
apples,medium,2
pear,medium,3
grape,small,4
durian,,1

The default behavior gives a dataframe with a NaN in place of the empty value from this last row:

df = pd.io.parsers.read_csv("foo.csv")

This gives the same dataframe with a blank string instead of a NaN. So far so good:

df = pd.io.parsers.read_csv("foo.csv", na_filter=False)

My expectation was that this next version would give a dataframe with no NaN values in the index, but it does not:

df = pd.io.parsers.read_csv("foo.csv", index_col=['fruit','size'], na_filter=False)
print df
=>                sugar
   fruit  size         
   apples medium      2
   pear   medium      3
   grape  small       4
   durian NaN         1

Because it unexpectedly includes NaNs, I've been fighting with issue 4862 in unstack for hours :-(.

In order to get the desired behavior, a DF with no NaNs in the index, I have to read the data without a multi-index, then set_index afterwards:

df = pd.io.parsers.read_csv("foo.csv", na_filter=False)
df.set_index(['fruit','size'])

As a temporary fix, perhaps the documentation ought to clarify the behavior of na_filter with respect to index_col.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2014

I'll mark it as a bug, but the 2nd soln looks fine to me. Trying to have the parser do too much is in general a problem IMHO.

@jreback jreback added this to the 0.15.0 milestone Jun 20, 2014
@dlenski
Copy link
Author

dlenski commented Jun 20, 2014

@jreback, the parser already knows how to distinguish NaNs, or not to distinguish them, right? Isn't that what na_filter is for?

The obvious user expectation is that index_col should have the same effect as calling set_index afterwards. The fact that it interacts with the behavior of na_filter is both surprising (at odds with the reasonable expected behavior) and unmentioned in the docs.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2014

I marked it as a bug. You are welcome to do a pull-request. My point was that their are close to 50 options for the parser, so their are obviously some untested paths.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@tommycarstensen
Copy link

This bug has been fixed and the issue can be closed.

@jreback
Copy link
Contributor

jreback commented Dec 22, 2017

@gfyoung do we have a test for this?

@gfyoung
Copy link
Member

gfyoung commented Dec 22, 2017

This is a dupe of #5239. Closed by #18127 (so yes, there is a test).

@gfyoung gfyoung closed this as completed Dec 22, 2017
@gfyoung gfyoung modified the milestones: Next Major Release, No action Dec 22, 2017
@gfyoung gfyoung added the Duplicate Report Duplicate issue or pull request label Dec 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants