Skip to content

Bug in read_csv when passing header kwarg? #4702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jtratner opened this issue Aug 28, 2013 · 25 comments · Fixed by #5298
Closed

Bug in read_csv when passing header kwarg? #4702

jtratner opened this issue Aug 28, 2013 · 25 comments · Fixed by #5298
Labels
API Design IO CSV read_csv, to_csv
Milestone

Comments

@jtratner
Copy link
Contributor

Seems to lose a row when passing a list as header, check it out:

>>> from StringIO import StringIO
>>> import pandas
>>> text = """
a  a  a  b  c  c
q r s t u v
1 2 3 4 5 6
1 2 3 4 5 6
"""
>>> # expected behavior without header
>>> pandas.read_csv(StringIO(text.strip()), sep='\s+')
   a a.1 a.2  b  c c.1
0  q   r   s  t  u   v
1  1   2   3  4  5   6
2  1   2   3  4  5   6
>>> #drops row  with specified header
>>> pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1])
   (a, q)  (a, r)  (a, s)  (b, t)  (c, u)  (c, v)
0       1       2       3       4       5       6
>>> pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
   a        b  c
   q  r  s  t  u  v
0  1  2  3  4  5  6
@cpcloud
Copy link
Member

cpcloud commented Aug 28, 2013

interesting...wonder if this is related to the interaction between header and skiprows...

@jtratner
Copy link
Contributor Author

Nevermind on the last one :P I added a blank line.

@jtratner
Copy link
Contributor Author

So if you add extra rows, it specifically skips what ought to be the third row:

In [27]: text += """6 5 4 3 2 1
   ....: 6 5 4 3 2 1"""

In [28]:

In [28]: pandas.read_csv(StringIO(text.strip()), sep='\s+')
Out[28]:
   a a.1 a.2  b  c c.1
0  q   r   s  t  u   v
1  1   2   3  4  5   6
2  1   2   3  4  5   6
3  6   5   4  3  2   1
4  6   5   4  3  2   1

In [29]: pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
Out[29]:
   a        b  c
   q  r  s  t  u  v
0  1  2  3  4  5  6
1  6  5  4  3  2  1
2  6  5  4  3  2  1

@jtratner
Copy link
Contributor Author

It's stripping what should be the second row I guess::

In [33]: text = """
   ....: a  a  a  b  c  c
   ....: q r s t u v
   ....: 0 0 0 0 0 1
   ....: 1 2 3 4 5 6
   ....: 1 2 3 4 5 6
   ....: """

In [34]: pandas.read_csv(StringIO(text.strip()), sep='\s+')
Out[34]:
   a a.1 a.2  b  c c.1
0  q   r   s  t  u   v
1  0   0   0  0  0   1
2  1   2   3  4  5   6
3  1   2   3  4  5   6

In [35]: pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
Out[35]:
   a        b  c
   q  r  s  t  u  v
0  1  2  3  4  5  6
1  1  2  3  4  5  6

@jtratner
Copy link
Contributor Author

@cpcloud maybe it's an off-by-one error? (e.g., it passes a skiprows because of the header and it's passing one higher than it should)

@jreback
Copy link
Contributor

jreback commented Aug 28, 2013

no, this is expected, as you MUST have a row of row names (which it skips); I couldn't find a reliable way to detect that

@jreback
Copy link
Contributor

jreback commented Aug 28, 2013

http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex

see the last note; I suppose in theory you could look for row names that don't have the full columns, but at the time seemed complicated.

coould be a bug too (though your format is not conforming to what its supposed to read).

@jtratner
Copy link
Contributor Author

I just misread. I thought it needed repeats, but instead it's looking for whitespace

@cancan101
Copy link
Contributor

Is this related: #4382?

@jtratner
Copy link
Contributor Author

@jreback no, I'm definitely just wrong. I read the csv intro as having repeated column names (instead of the empty column names to the right)

@cancan101
Copy link
Contributor

@jtratner I feel like I am seeing the same thing here (i.e. there is in fact a bug):

CSV:

a,,b,,c,
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2

and then:

In [66]: pd.read_csv("test2.csv",header=[0,1],tupleize_cols=False)
Out[66]: 
    a  Unnamed: 1_level_0   b  Unnamed: 3_level_0   c  Unnamed: 5_level_0
   a1                  a2  b1                  b2  c1                  c2
0   2                   2   2                   2   2                   2

even if I clean up the CSV:

a,a,b,b,c,c
a1,a2,b1,b2,c1,c2
1,1,1,1,1,1
2,2,2,2,2,2

I still get:

In [67]: pd.read_csv("test3.csv",header=[0,1],tupleize_cols=False)
Out[67]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   2   2   2   2   2   2

@jtratner
Copy link
Contributor Author

@cancan101 well, that's certainly not what I would expect. @jreback why does it need to have a row of row names?

@jreback
Copy link
Contributor

jreback commented Aug 29, 2013

you can try and fix if u want, but couldn't find a way to make a general parser that could parse to_csv format (that does have all row and col names) and a format that is what u have above, without the user specifying more information that the header. bottom line is that maybe the above format should raise as an invalid format (unless you can find a way to parse it) without sacrificing the rest of the cases

@cancan101
Copy link
Contributor

Why would the above format be invalid?

@jreback
Copy link
Contributor

jreback commented Aug 29, 2013

it's not invalid per se but how do you know that the 3rd row is supposed to be row names (and just over specified) or values?

csv format is just not descriptive enough to figure it out

and that is what to_csv outputs
it HAS to assume it is row names in order to be round trip able

@cancan101
Copy link
Contributor

I am confused here. The parser tries to read the index values (ie the row names) from the third row?

@jreback jreback reopened this Aug 31, 2013
@jreback
Copy link
Contributor

jreback commented Aug 31, 2013

Here are the cases this was setup to read. It certainly can be changed to read a mi columns with out a row of row-names, but I think you would need to have a way to tell the reader to do this. IIRC there is no easy way to 'figure' it out. Any suggestions?

In [27]: df = DataFrame([[1,1,1,1,1,1],[2,2,2,2,2,2]],columns=pd.MultiIndex.from_arrays([['a','a','b','b','c','c'],['a1','a2','b1','b2','c1','c2']]))

In [28]: df
Out[28]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   1   1   1   1   1   1
1   2   2   2   2   2   2

In [34]: df.to_csv('test.csv',tupleize_cols=False,index=None)

In [35]: !cat test.csv
a,a,b,b,c,c
a1,a2,b1,b2,c1,c2
,,,,,
1,1,1,1,1,1
2,2,2,2,2,2

In [38]: pd.read_csv('test.csv',header=[0,1],index_col=None,tupleize_cols=False)
Out[38]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   1   1   1   1   1   1
1   2   2   2   2   2   2
In [43]: df.to_csv('test.csv',tupleize_cols=False)

In [44]: !cat test.csv
,a,a,b,b,c,c
,a1,a2,b1,b2,c1,c2
,,,,,,
0,1,1,1,1,1,1
1,2,2,2,2,2,2

In [45]: pd.read_csv('test.csv',header=[0,1],index_col=0,tupleize_cols=False)
Out[45]: 
    a       b       c    
   a1  a2  b1  b2  c1  c2
0   1   1   1   1   1   1
1   2   2   2   2   2   2

@jtratner
Copy link
Contributor Author

@jreback I don't understand what you mean by "row of row names" - can you show an example of what the use case is for this? I would think that the far left column would contain the row labels (if any) and its headers would be the name for the index.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2013

Meant row if index name (or names if the index is a multi-index too).

multi-index on columns

In [50]: df = DataFrame([[1,1,1,1,1,1],[2,2,2,2,2,2]],columns=pd.MultiIndex.from_arrays([['a','a','b','b','c','c'],['a1','a2','b1','b2','c1','c2']],names=['first','second']),index=Index([0,1],name='foo'))

In [51]: df.to_csv('test.csv',tupleize_cols=False)

In [52]: !cat test.csv
first,a,a,b,b,c,c
second,a1,a2,b1,b2,c1,c2
foo,,,,,,
0,1,1,1,1,1,1
1,2,2,2,2,2,2

In [56]: pd.read_csv('test.csv',header=[0,1],tupleize_cols=False,index_col=0)
Out[56]: 
first    a       b       c    
second  a1  a2  b1  b2  c1  c2
foo                           
0        1   1   1   1   1   1
1        2   2   2   2   2   2

multi index on both axes

In [57]: df = DataFrame([[1,1,1,1,1,1],[2,2,2,2,2,2]],columns=pd.MultiIndex.from_arrays([['a','a','b','b','c','c'],['a1','a2','b1','b2','c1','c2']],names=['first','second']),index=pd.MultiIndex.from_arrays([['A',0],['A',1]],names=['l1','l2']))

In [58]: df
Out[58]: 
first    a       b       c    
second  a1  a2  b1  b2  c1  c2
l1 l2                         
A  A     1   1   1   1   1   1
0  1     2   2   2   2   2   2

In [59]: df.to_csv('test.csv',tupleize_cols=False)

In [60]: !cat test.csv
first,,a,a,b,b,c,c
second,,a1,a2,b1,b2,c1,c2
l1,l2,,,,,,
A,A,1,1,1,1,1,1
0,1,2,2,2,2,2,2

In [61]: pd.read_csv('test.csv',header=[0,1],tupleize_cols=False,index_col=[0,1])
Out[61]: 
first    a       b       c    
second  a1  a2  b1  b2  c1  c2
l1 l2                         
A  A     1   1   1   1   1   1
0  1     2   2   2   2   2   2

@jreback
Copy link
Contributor

jreback commented Aug 31, 2013

So need to add an additional parameter to indicate that this 'extra' row need not be generated (in to_csv)
or skipped on the read back (in read_csv).

@jtratner
Copy link
Contributor Author

@jreback wow, that's a crazy case. Maybe we should just add a keyword argument that lets you skip it in read_csv and leave it alone aside from that.

@mortbauer
Copy link

Hey, I didn't follow everything in detail here, but i experienced the smae behavior, so I looked into the source files and found something with which i was able to resolve the loosing of the line. For me it seems as a real bug.
so the source file i mean is io.parsers.py, the relevant code is:

 def _infer_columns(self):
        names = self.names

        if self.header is not None:
            header = self.header

            # we have a mi columns, so read and extra line
            if isinstance(header,(list,tuple,np.ndarray)):
                have_mi_columns = True
                header = list(header) + [header[-1]+1]
            else:
                have_mi_columns = False
                header = [ header ]

so it just increases the line number of the last entry if one passes a list to the header kwarg, with the initial example:

>>> from StringIO import StringIO
>>> import pandas
>>> text = """
a  a  a  b  c  c
q r s t u v
1 2 3 4 5 6
1 2 3 4 5 6
"""

one could do:

>>> pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[1, 0])
   (q, a)  (r, a)  (s, a)  (t, b)  (u, c)  (v, c)
0       1       2       3       4       5       6
0       1       2       3       4       5       6

to get the almost desired output, see that the header is now upside down, but nothing is skipped.

@jreback
Copy link
Contributor

jreback commented Oct 22, 2013

@mortbauer thanks for the example. This is more complicated that it looks because to_csv needs to write in this format; read_csv needs a keyword to determine whether to skip an extra line (which it needs now) or not.

@jtratner
Copy link
Contributor Author

If I'm reading this correctly, this is the same issue regarding ambiguity
of naming column levels or index levels. While I get why pandas wants to
have that nearly blank row for column names, many actual files are not
created like that. Usually you have a header for every column (or
potentially multiple levels of header).

Could we check for a row that has only the first N labels filled (where N
is the number of Index cols) occurring right after the last header row with
everything blank after that. If we find it, then assume those are the
index labels and the header row has labels for the column levels.

@jtratner
Copy link
Contributor Author

s/labels/names above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO CSV read_csv, to_csv
Projects
None yet
5 participants