read_csv() ignores na_filter=False for index columns #7518

dlenski · 2014-06-19T23:49:06Z

Using 0.14.0. pandas.io.parsers.read_csv is supposed to ignore blank-looking values if na_filter=False, but it does not do this for index_col columns.

foo.csv:

fruit,size,sugar
apples,medium,2
pear,medium,3
grape,small,4
durian,,1

The default behavior gives a dataframe with a NaN in place of the empty value from this last row:

df = pd.io.parsers.read_csv("foo.csv")

This gives the same dataframe with a blank string instead of a NaN. So far so good:

df = pd.io.parsers.read_csv("foo.csv", na_filter=False)

My expectation was that this next version would give a dataframe with no NaN values in the index, but it does not:

df = pd.io.parsers.read_csv("foo.csv", index_col=['fruit','size'], na_filter=False)
print df
=>                sugar
   fruit  size         
   apples medium      2
   pear   medium      3
   grape  small       4
   durian NaN         1

Because it unexpectedly includes NaNs, I've been fighting with issue 4862 in unstack for hours :-(.

In order to get the desired behavior, a DF with no NaNs in the index, I have to read the data without a multi-index, then set_index afterwards:

df = pd.io.parsers.read_csv("foo.csv", na_filter=False)
df.set_index(['fruit','size'])

As a temporary fix, perhaps the documentation ought to clarify the behavior of na_filter with respect to index_col.

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-20T00:02:10Z

I'll mark it as a bug, but the 2nd soln looks fine to me. Trying to have the parser do too much is in general a problem IMHO.

dlenski · 2014-06-20T00:08:37Z

@jreback, the parser already knows how to distinguish NaNs, or not to distinguish them, right? Isn't that what na_filter is for?

The obvious user expectation is that index_col should have the same effect as calling set_index afterwards. The fact that it interacts with the behavior of na_filter is both surprising (at odds with the reasonable expected behavior) and unmentioned in the docs.

jreback · 2014-06-20T00:26:16Z

I marked it as a bug. You are welcome to do a pull-request. My point was that their are close to 50 options for the parser, so their are obviously some untested paths.

tommycarstensen · 2017-12-22T15:30:39Z

This bug has been fixed and the issue can be closed.

jreback · 2017-12-22T18:56:00Z

@gfyoung do we have a test for this?

gfyoung · 2017-12-22T19:18:44Z

This is a dupe of #5239. Closed by #18127 (so yes, there is a test).

jreback added Bug labels Jun 20, 2014

jreback added this to the 0.15.0 milestone Jun 20, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

gfyoung closed this as completed Dec 22, 2017

gfyoung modified the milestones: Next Major Release, No action Dec 22, 2017

gfyoung added the Duplicate Report Duplicate issue or pull request label Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv() ignores na_filter=False for index columns #7518

read_csv() ignores na_filter=False for index columns #7518

dlenski commented Jun 19, 2014

jreback commented Jun 20, 2014

dlenski commented Jun 20, 2014

jreback commented Jun 20, 2014

tommycarstensen commented Dec 22, 2017

jreback commented Dec 22, 2017

gfyoung commented Dec 22, 2017

read_csv() ignores na_filter=False for index columns #7518

read_csv() ignores na_filter=False for index columns #7518

Comments

dlenski commented Jun 19, 2014

jreback commented Jun 20, 2014

dlenski commented Jun 20, 2014

jreback commented Jun 20, 2014

tommycarstensen commented Dec 22, 2017

jreback commented Dec 22, 2017

gfyoung commented Dec 22, 2017