-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Bug in read_csv when passing header kwarg? #4702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
interesting...wonder if this is related to the interaction between |
Nevermind on the last one :P I added a blank line. |
So if you add extra rows, it specifically skips what ought to be the third row: In [27]: text += """6 5 4 3 2 1
....: 6 5 4 3 2 1"""
In [28]:
In [28]: pandas.read_csv(StringIO(text.strip()), sep='\s+')
Out[28]:
a a.1 a.2 b c c.1
0 q r s t u v
1 1 2 3 4 5 6
2 1 2 3 4 5 6
3 6 5 4 3 2 1
4 6 5 4 3 2 1
In [29]: pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
Out[29]:
a b c
q r s t u v
0 1 2 3 4 5 6
1 6 5 4 3 2 1
2 6 5 4 3 2 1 |
It's stripping what should be the second row I guess:: In [33]: text = """
....: a a a b c c
....: q r s t u v
....: 0 0 0 0 0 1
....: 1 2 3 4 5 6
....: 1 2 3 4 5 6
....: """
In [34]: pandas.read_csv(StringIO(text.strip()), sep='\s+')
Out[34]:
a a.1 a.2 b c c.1
0 q r s t u v
1 0 0 0 0 0 1
2 1 2 3 4 5 6
3 1 2 3 4 5 6
In [35]: pandas.read_csv(StringIO(text.strip()), sep='\s+', header=[0, 1], tupleize_cols=False)
Out[35]:
a b c
q r s t u v
0 1 2 3 4 5 6
1 1 2 3 4 5 6 |
@cpcloud maybe it's an off-by-one error? (e.g., it passes a skiprows because of the header and it's passing one higher than it should) |
no, this is expected, as you MUST have a row of row names (which it skips); I couldn't find a reliable way to detect that |
http://pandas.pydata.org/pandas-docs/dev/io.html#reading-columns-with-a-multiindex see the last note; I suppose in theory you could look for row names that don't have the full columns, but at the time seemed complicated. coould be a bug too (though your format is not conforming to what its supposed to read). |
I just misread. I thought it needed repeats, but instead it's looking for whitespace |
Is this related: #4382? |
@jreback no, I'm definitely just wrong. I read the csv intro as having repeated column names (instead of the empty column names to the right) |
@jtratner I feel like I am seeing the same thing here (i.e. there is in fact a bug): CSV:
and then:
even if I clean up the CSV:
I still get:
|
@cancan101 well, that's certainly not what I would expect. @jreback why does it need to have a row of row names? |
you can try and fix if u want, but couldn't find a way to make a general parser that could parse to_csv format (that does have all row and col names) and a format that is what u have above, without the user specifying more information that the header. bottom line is that maybe the above format should raise as an invalid format (unless you can find a way to parse it) without sacrificing the rest of the cases |
Why would the above format be invalid? |
it's not invalid per se but how do you know that the 3rd row is supposed to be row names (and just over specified) or values? csv format is just not descriptive enough to figure it out and that is what to_csv outputs |
I am confused here. The parser tries to read the index values (ie the row names) from the third row? |
Here are the cases this was setup to read. It certainly can be changed to read a mi columns with out a row of row-names, but I think you would need to have a way to tell the reader to do this. IIRC there is no easy way to 'figure' it out. Any suggestions?
|
@jreback I don't understand what you mean by "row of row names" - can you show an example of what the use case is for this? I would think that the far left column would contain the row labels (if any) and its headers would be the name for the index. |
Meant row if index name (or names if the index is a multi-index too). multi-index on columns
multi index on both axes
|
So need to add an additional parameter to indicate that this 'extra' row need not be generated (in |
@jreback wow, that's a crazy case. Maybe we should just add a keyword argument that lets you skip it in read_csv and leave it alone aside from that. |
Hey, I didn't follow everything in detail here, but i experienced the smae behavior, so I looked into the source files and found something with which i was able to resolve the loosing of the line. For me it seems as a real bug.
so it just increases the line number of the last entry if one passes a list to the header kwarg, with the initial example:
one could do:
to get the almost desired output, see that the header is now upside down, but nothing is skipped. |
@mortbauer thanks for the example. This is more complicated that it looks because |
If I'm reading this correctly, this is the same issue regarding ambiguity Could we check for a row that has only the first N labels filled (where N |
s/labels/names above |
Seems to lose a row when passing a list as header, check it out:
The text was updated successfully, but these errors were encountered: