-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Patch missing data handling with usecols #15066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Patch missing data handling with usecols #15066
Conversation
Current coverage is 84.75% (diff: 100%)@@ master #15066 diff @@
==========================================
Files 145 145
Lines 51220 51220
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43415 43413 -2
- Misses 7805 7807 +2
Partials 0 0
|
@jorisvandenbossche can you have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and no concern regarding fix/test for #6710.
For issue #8985, I am a bit less sure this is necessarily the right thing. In any case, it is nowhere mentioned that usecols
can actually be used for this case. So if we fix this 'bug', I would rather regard it as an enhancement to read in malformed files, and also document this.
df = self.read_csv(StringIO(data), names=names, usecols=usecols) | ||
tm.assert_frame_equal(df, expected) | ||
|
||
# see gh-8985 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in a separate test that is called differently? As AFAIK this one is not about an incomplete first row (multiple rows are shorter, so it is inferred that there are 3 columns) But it is rather about ignoring excessive values on certain columns when using usecols
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Done.
Regarding the 'malformed' lines (rows with too many values), we eg also have |
@jorisvandenbossche : Let me address your comments separately:
|
852300e
to
bfe01c0
Compare
Can you point me to documentation and/or tests that specifies this behaviour of |
The usecols parameter is specifying three columns. Every row has three columns, so the parameter is perfectly valid. We should be able to handle it just fine. |
I am not saying that we necessarily shouldn't do this (the use case is certainly useful). My request above was just, if we fix this, to document this as a feature of |
BTW, the example case for #8985 actually already seems to work on latest master?
|
Handling missing data is independent of usecols IMO. That's why this isn't a "feature" for me. |
It's rather about too many fields rather than missing fields. But to the point :-), we have some documentation about how read_csv handles malformed files (http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines + docstring on those keywords), which is: raising an error when there are lines with more fields than the number of columns (or skip the line altogether if you specify a keyword). So IMO it would be nice to explain there that you can also make use of BTW, it seems already worked in at least 0.18.1, so it's indeed not a bug 'fix' :-) |
|
bfe01c0
to
6bb558d
Compare
@jorisvandenbossche : Added example in the docs regarding |
@gfyoung Thanks a lot for the addition! |
Patch handling of cases when the first row of a CSV is incomplete and
usecols
is specified.Closes #6710.
Closes #8985.
xref #14782.