Bug in read_csv when passing header kwarg? #4702

jtratner · 2013-08-28T20:49:09Z

Seems to lose a row when passing a list as header, check it out:

>>> from StringIO import StringIO
>>> import pandas
>>> text = """
a  a  a  b  c  c
q r s t u v
1 2 3 4 5 6
1 2 3 4 5 6
"""
>>> # expected behavior without header
>>> pandas.read_csv(StringIO(text.strip()), sep='\s+')
   a a.1 a.2  b  c c.1
0  q   r   s  t  u   v
1  1   2   3  4  5   6
2  1   2   3  4  5   6
>>> #drops row  with specified header
>>> pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1])
   (a, q)  (a, r)  (a, s)  (b, t)  (c, u)  (c, v)
0       1       2       3       4       5       6
>>> pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
   a        b  c
   q  r  s  t  u  v
0  1  2  3  4  5  6

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-08-28T20:51:08Z

interesting...wonder if this is related to the interaction between header and skiprows...

jtratner · 2013-08-28T20:51:18Z

Nevermind on the last one :P I added a blank line.

jtratner · 2013-08-28T20:52:47Z

So if you add extra rows, it specifically skips what ought to be the third row:

In [27]: text += """6 5 4 3 2 1
   ....: 6 5 4 3 2 1"""

In [28]:

In [28]: pandas.read_csv(StringIO(text.strip()), sep='\s+')
Out[28]:
   a a.1 a.2  b  c c.1
0  q   r   s  t  u   v
1  1   2   3  4  5   6
2  1   2   3  4  5   6
3  6   5   4  3  2   1
4  6   5   4  3  2   1

In [29]: pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
Out[29]:
   a        b  c
   q  r  s  t  u  v
0  1  2  3  4  5  6
1  6  5  4  3  2  1
2  6  5  4  3  2  1

jtratner · 2013-08-28T20:54:33Z

It's stripping what should be the second row I guess::

In [33]: text = """
   ....: a  a  a  b  c  c
   ....: q r s t u v
   ....: 0 0 0 0 0 1
   ....: 1 2 3 4 5 6
   ....: 1 2 3 4 5 6
   ....: """

In [34]: pandas.read_csv(StringIO(text.strip()), sep='\s+')
Out[34]:
   a a.1 a.2  b  c c.1
0  q   r   s  t  u   v
1  0   0   0  0  0   1
2  1   2   3  4  5   6
3  1   2   3  4  5   6

In [35]: pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
Out[35]:
   a        b  c
   q  r  s  t  u  v
0  1  2  3  4  5  6
1  1  2  3  4  5  6

jtratner · 2013-08-28T20:56:25Z

@cpcloud maybe it's an off-by-one error? (e.g., it passes a skiprows because of the header and it's passing one higher than it should)

jreback · 2013-08-28T20:59:52Z

no, this is expected, as you MUST have a row of row names (which it skips); I couldn't find a reliable way to detect that

jreback · 2013-08-28T21:01:54Z

http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex

see the last note; I suppose in theory you could look for row names that don't have the full columns, but at the time seemed complicated.

coould be a bug too (though your format is not conforming to what its supposed to read).

jtratner · 2013-08-28T22:12:24Z

I just misread. I thought it needed repeats, but instead it's looking for whitespace

cancan101 · 2013-08-28T22:54:05Z

Is this related: #4382?

jtratner · 2013-08-28T23:20:25Z

@jreback no, I'm definitely just wrong. I read the csv intro as having repeated column names (instead of the empty column names to the right)

cancan101 · 2013-08-29T03:43:23Z

@jtratner I feel like I am seeing the same thing here (i.e. there is in fact a bug):

CSV:

a,,b,,c,
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2

and then:

In [66]: pd.read_csv("test2.csv",header=[0,1],tupleize_cols=False)
Out[66]: 
    a  Unnamed: 1_level_0   b  Unnamed: 3_level_0   c  Unnamed: 5_level_0
   a1                  a2  b1                  b2  c1                  c2
0   2                   2   2                   2   2                   2

even if I clean up the CSV:

a,a,b,b,c,c
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2

I still get:

In [67]: pd.read_csv("test3.csv",header=[0,1],tupleize_cols=False)
Out[67]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   2   2   2   2   2   2

jtratner · 2013-08-29T03:47:20Z

@cancan101 well, that's certainly not what I would expect. @jreback why does it need to have a row of row names?

jreback · 2013-08-29T04:13:16Z

you can try and fix if u want, but couldn't find a way to make a general parser that could parse to_csv format (that does have all row and col names) and a format that is what u have above, without the user specifying more information that the header. bottom line is that maybe the above format should raise as an invalid format (unless you can find a way to parse it) without sacrificing the rest of the cases

cancan101 · 2013-08-29T04:15:30Z

Why would the above format be invalid?

jreback · 2013-08-29T04:19:02Z

it's not invalid per se but how do you know that the 3rd row is supposed to be row names (and just over specified) or values?

csv format is just not descriptive enough to figure it out

and that is what to_csv outputs
it HAS to assume it is row names in order to be round trip able

cancan101 · 2013-08-29T04:23:20Z

I am confused here. The parser tries to read the index values (ie the row names) from the third row?

jreback · 2013-08-31T21:51:14Z

Here are the cases this was setup to read. It certainly can be changed to read a mi columns with out a row of row-names, but I think you would need to have a way to tell the reader to do this. IIRC there is no easy way to 'figure' it out. Any suggestions?

In [27]: df = DataFrame([[1,1,1,1,1,1],[2,2,2,2,2,2]],columns=pd.MultiIndex.from_arrays([['a','a','b','b','c','c'],['a1','a2','b1','b2','c1','c2']]))

In [28]: df
Out[28]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   1   1   1   1   1   1
1   2   2   2   2   2   2

In [34]: df.to_csv('test.csv',tupleize_cols=False,index=None)

In [35]: !cat test.csv
a,a,b,b,c,c
a1,a2,b1,b2,c1,c2
,,,,,
1,1,1,1,1,1
2,2,2,2,2,2

In [38]: pd.read_csv('test.csv',header=[0,1],index_col=None,tupleize_cols=False)
Out[38]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   1   1   1   1   1   1
1   2   2   2   2   2   2

In [43]: df.to_csv('test.csv',tupleize_cols=False)

In [44]: !cat test.csv
,a,a,b,b,c,c
,a1,a2,b1,b2,c1,c2
,,,,,,
0,1,1,1,1,1,1
1,2,2,2,2,2,2

In [45]: pd.read_csv('test.csv',header=[0,1],index_col=0,tupleize_cols=False)
Out[45]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   1   1   1   1   1   1
1   2   2   2   2   2   2

jtratner · 2013-08-31T22:21:04Z

@jreback I don't understand what you mean by "row of row names" - can you show an example of what the use case is for this? I would think that the far left column would contain the row labels (if any) and its headers would be the name for the index.

jreback · 2013-08-31T22:34:21Z

Meant row if index name (or names if the index is a multi-index too).

multi-index on columns

In [50]: df = DataFrame([[1,1,1,1,1,1],[2,2,2,2,2,2]],columns=pd.MultiIndex.from_arrays([['a','a','b','b','c','c'],['a1','a2','b1','b2','c1','c2']],names=['first','second']),index=Index([0,1],name='foo'))

In [51]: df.to_csv('test.csv',tupleize_cols=False)

In [52]: !cat test.csv
first,a,a,b,b,c,c
second,a1,a2,b1,b2,c1,c2
foo,,,,,,
0,1,1,1,1,1,1
1,2,2,2,2,2,2

In [56]: pd.read_csv('test.csv',header=[0,1],tupleize_cols=False,index_col=0)
Out[56]: 
first    a       b       c    
second  a1  a2  b1  b2  c1  c2
foo                           
0        1   1   1   1   1   1
1        2   2   2   2   2   2

multi index on both axes

In [57]: df = DataFrame([[1,1,1,1,1,1],[2,2,2,2,2,2]],columns=pd.MultiIndex.from_arrays([['a','a','b','b','c','c'],['a1','a2','b1','b2','c1','c2']],names=['first','second']),index=pd.MultiIndex.from_arrays([['A',0],['A',1]],names=['l1','l2']))

In [58]: df
Out[58]: 
first    a       b       c    
second  a1  a2  b1  b2  c1  c2
l1 l2                         
A  A     1   1   1   1   1   1
0  1     2   2   2   2   2   2

In [59]: df.to_csv('test.csv',tupleize_cols=False)

In [60]: !cat test.csv
first,,a,a,b,b,c,c
second,,a1,a2,b1,b2,c1,c2
l1,l2,,,,,,
A,A,1,1,1,1,1,1
0,1,2,2,2,2,2,2

In [61]: pd.read_csv('test.csv',header=[0,1],tupleize_cols=False,index_col=[0,1])
Out[61]: 
first    a       b       c    
second  a1  a2  b1  b2  c1  c2
l1 l2                         
A  A     1   1   1   1   1   1
0  1     2   2   2   2   2   2

jreback · 2013-08-31T22:36:40Z

So need to add an additional parameter to indicate that this 'extra' row need not be generated (in to_csv)
or skipped on the read back (in read_csv).

jtratner · 2013-08-31T22:54:00Z

@jreback wow, that's a crazy case. Maybe we should just add a keyword argument that lets you skip it in read_csv and leave it alone aside from that.

mortbauer · 2013-10-22T12:15:11Z

Hey, I didn't follow everything in detail here, but i experienced the smae behavior, so I looked into the source files and found something with which i was able to resolve the loosing of the line. For me it seems as a real bug.
so the source file i mean is io.parsers.py, the relevant code is:

 def _infer_columns(self):
        names = self.names

        if self.header is not None:
            header = self.header

            # we have a mi columns, so read and extra line
            if isinstance(header,(list,tuple,np.ndarray)):
                have_mi_columns = True
                header = list(header) + [header[-1]+1]
            else:
                have_mi_columns = False
                header = [ header ]

so it just increases the line number of the last entry if one passes a list to the header kwarg, with the initial example:

>>> from StringIO import StringIO
>>> import pandas
>>> text = """
a  a  a  b  c  c
q r s t u v
1 2 3 4 5 6
1 2 3 4 5 6
"""

one could do:

>>> pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[1, 0])
   (q, a)  (r, a)  (s, a)  (t, b)  (u, c)  (v, c)
0       1       2       3       4       5       6
0       1       2       3       4       5       6

to get the almost desired output, see that the header is now upside down, but nothing is skipped.

jreback · 2013-10-22T12:18:28Z

@mortbauer thanks for the example. This is more complicated that it looks because to_csv needs to write in this format; read_csv needs a keyword to determine whether to skip an extra line (which it needs now) or not.

jtratner · 2013-10-22T12:39:08Z

If I'm reading this correctly, this is the same issue regarding ambiguity
of naming column levels or index levels. While I get why pandas wants to
have that nearly blank row for column names, many actual files are not
created like that. Usually you have a header for every column (or
potentially multiple levels of header).

Could we check for a row that has only the first N labels filled (where N
is the number of Index cols) occurring right after the last header row with
everything blank after that. If we find it, then assume those are the
index labels and the header row has labels for the column levels.

jtratner · 2013-10-22T12:40:52Z

s/labels/names above

jtratner closed this as completed Aug 28, 2013

jreback reopened this Aug 31, 2013

jreback mentioned this issue Oct 22, 2013

BUG: parser can handle a common_format multi-column index (no row index cols), (GH4702) #5298

Merged

jreback closed this as completed in #5298 Oct 24, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in read_csv when passing header kwarg? #4702

Bug in read_csv when passing header kwarg? #4702

jtratner commented Aug 28, 2013

cpcloud commented Aug 28, 2013

jtratner commented Aug 28, 2013

jtratner commented Aug 28, 2013

jtratner commented Aug 28, 2013

jtratner commented Aug 28, 2013

jreback commented Aug 28, 2013

jreback commented Aug 28, 2013

jtratner commented Aug 28, 2013

cancan101 commented Aug 28, 2013

jtratner commented Aug 28, 2013

cancan101 commented Aug 29, 2013

jtratner commented Aug 29, 2013

jreback commented Aug 29, 2013

cancan101 commented Aug 29, 2013

jreback commented Aug 29, 2013

cancan101 commented Aug 29, 2013

jreback commented Aug 31, 2013

jtratner commented Aug 31, 2013

jreback commented Aug 31, 2013

jreback commented Aug 31, 2013

jtratner commented Aug 31, 2013

mortbauer commented Oct 22, 2013

jreback commented Oct 22, 2013

jtratner commented Oct 22, 2013

jtratner commented Oct 22, 2013

Bug in read_csv when passing header kwarg? #4702

Bug in read_csv when passing header kwarg? #4702

Comments

jtratner commented Aug 28, 2013

cpcloud commented Aug 28, 2013

jtratner commented Aug 28, 2013

jtratner commented Aug 28, 2013

jtratner commented Aug 28, 2013

jtratner commented Aug 28, 2013

jreback commented Aug 28, 2013

jreback commented Aug 28, 2013

jtratner commented Aug 28, 2013

cancan101 commented Aug 28, 2013

jtratner commented Aug 28, 2013

cancan101 commented Aug 29, 2013

jtratner commented Aug 29, 2013

jreback commented Aug 29, 2013

cancan101 commented Aug 29, 2013

jreback commented Aug 29, 2013

cancan101 commented Aug 29, 2013

jreback commented Aug 31, 2013

jtratner commented Aug 31, 2013

jreback commented Aug 31, 2013

jreback commented Aug 31, 2013

jtratner commented Aug 31, 2013

mortbauer commented Oct 22, 2013

jreback commented Oct 22, 2013

jtratner commented Oct 22, 2013

jtratner commented Oct 22, 2013