-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv raises exception if first row is incomplete #6710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hmm.. I think the error is useful here, rather than providing short row behavior. You said the header was 3 fields (e.g. by specifying What would you have the results be? |
I don't see why the behavior where the first row is short should differ from where the second row is short. If I specify three columns in the header, then I want three columns. Any row which has fewer values than that should be filled with NaN, assuming the dtype supports it, thus: In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'], usecols=['a', 'c'])
Out[101]:
a c
0 1 NaN
1 1 3
2 4 6
[3 rows x 2 columns] The current behavior is inconsistent and frankly not very useful. My understanding from the docs is that the intention is to handle real-world data files intelligently and gracefully so we can go about getting our work done. Raising an exception in this case is neither intelligent nor graceful. In any case, I might as well mention that the obvious workaround is to omit In [100]: mydata='1,2\n1,2,3\n4,5,6\n'
In [101]: df = pd.read_csv(StringIO(mydata), names=['a', 'b', 'c'])[['a', 'c']] The result is the same, but this approach sacrifices "much faster parsing time and lower memory usage" which we were promised with |
the difference is that the first row is a header and this is a mispecification of the names so you are asking pandas to decide which parameter is valid here I think that is a good reason to raise an exception that said would consider a pull request to modify this to just take the names (and maybe just have a warning) - I think as a user you would want to know that your file is malformed? otherwise you would have set header=None and just used skip lines no ? |
Thanks for your explanation. In this case, the first row is not a header and the file is not malformed any more than it would be if subsequent lines were short. Perhaps I misunderstood, but I was under the impression that header is set to None implicitly when I specify names in the call to read_csv. In any case, the exception is raised even with an explicit header=None. |
hmm...that could be a bug then if their is a complex interaction between parameters, this is most probably an untested case. |
I'm new to pandas. (I am really impressed and very much enjoying it, I should add.) I would love to fix this, but I had a look at the code and I wasn't quite able to follow it. Unfortunately, I don't have the time right now to track this down and fix it. The performance benefits of |
@altaurog great...thanks for reporting. marked as a bug. |
The case where
Do we want to get rid of this check? I don't see a good reason for it (and all tests pass without it), but maybe I'm missing something. |
An update here: seems like the C engine is happy, but the Python engine is a little sad: >>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = '1,2\n1,2,3'
>>> usecols = ['a', 'c']
>>> names = ['a', 'b', 'c']
>>>
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='c')
a c
0 1 NaN
1 1 3.0
>>> read_csv(StringIO(data), usecols=usecols, names=names, engine='python')
...
ValueError: Number of passed names did not match number of header fields in the file I suspect the error in the Python engine is being raised because we only check the first row and compare it to the header. |
Strangely enough I'm having the same problem in 1.3.5 |
xref #4749
xref #8985
This is a bit like issue #4749, but the conditions are different.
I have been able to reproduce it by specifying
usecols
:The text was updated successfully, but these errors were encountered: