BUG: Patch missing data handling with usecols #15066

gfyoung · 2017-01-05T05:05:57Z

Patch handling of cases when the first row of a CSV is incomplete and usecols is specified.

Closes #6710.
Closes #8985.

codecov-io · 2017-01-05T05:35:30Z

Current coverage is 84.75% (diff: 100%)

Merging #15066 into master will decrease coverage by <.01%

@@             master     #15066   diff @@
==========================================
  Files           145        145          
  Lines         51220      51220          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43415      43413     -2   
- Misses         7805       7807     +2   
  Partials          0          0

Powered by Codecov. Last update c71f214...6bb558d

jreback · 2017-01-09T18:17:02Z

@jorisvandenbossche can you have a look.

jorisvandenbossche

Looks good, and no concern regarding fix/test for #6710.

For issue #8985, I am a bit less sure this is necessarily the right thing. In any case, it is nowhere mentioned that usecols can actually be used for this case. So if we fix this 'bug', I would rather regard it as an enhancement to read in malformed files, and also document this.

jorisvandenbossche · 2017-01-09T20:50:15Z

pandas/io/tests/parser/usecols.py

+        df = self.read_csv(StringIO(data), names=names, usecols=usecols)
+        tm.assert_frame_equal(df, expected)
+
+        # see gh-8985


Can you put this in a separate test that is called differently? As AFAIK this one is not about an incomplete first row (multiple rows are shorter, so it is inferred that there are 3 columns) But it is rather about ignoring excessive values on certain columns when using usecols?

Fair enough. Done.

jorisvandenbossche · 2017-01-09T21:00:42Z

Regarding the 'malformed' lines (rows with too many values), we eg also have error_bad_lines and warn_bad_lines. Although this will 'drop' the full bad line instead of dropping only the excessive values on the line.
Slightly related, there is also #9729 which asks for similar handling for bad lines with too few values. So maybe we should have a more general discussion about how to handle bad lines (where the options are raise/drop full line/'process' combined with a warning or not in case of not raising)

gfyoung · 2017-01-10T05:36:41Z

@jorisvandenbossche : Let me address your comments separately:

csv_reader with limited number of columns should should completely disregard the unused fields #8985 is a bug, not an enhancement. The usecols parameter was 100% valid in the test case presented from the issue. I fail to see what new thing needs to be documented here.
"Bad" lines with too few fields #9729 is tangentially related to this issue because it has to deal with an inconsistent number of fields across rows. usecols should not be involved in that discussion. I don't believe this discussion needs include this PR.

jorisvandenbossche · 2017-01-10T19:46:55Z

#8985 is a bug, not an enhancement. The usecols parameter was 100% valid in the test case presented from the issue. I fail to see what new thing needs to be documented here.

Can you point me to documentation and/or tests that specifies this behaviour of usecols ?

gfyoung · 2017-01-10T19:49:43Z

The usecols parameter is specifying three columns. Every row has three columns, so the parameter is perfectly valid. We should be able to handle it just fine.

jorisvandenbossche · 2017-01-10T20:00:54Z

We should be able to handle it just fine.

I am not saying that we necessarily shouldn't do this (the use case is certainly useful). My request above was just, if we fix this, to document this as a feature of usecols

jorisvandenbossche · 2017-01-10T20:01:59Z

BTW, the example case for #8985 actually already seems to work on latest master?

In [21]: usecols = [0, 1, 2]

In [22]: data = '19,29,39\n' * 2 + '10,20,30,40'

In [25]: pd.read_csv(StringIO(data), header=None, usecols=usecols)
Out[25]: 
    0   1   2
0  19  29  39
1  19  29  39
2  10  20  30

In [26]: pd.read_csv(StringIO(data), header=None, usecols=usecols, engine='python')
Out[26]: 
    0   1   2
0  19  29  39
1  19  29  39
2  10  20  30

In [27]: pd.__version__
Out[27]: '0.19.0+317.g41170d4'

gfyoung · 2017-01-10T20:05:13Z

Handling missing data is independent of usecols IMO. That's why this isn't a "feature" for me.

jorisvandenbossche · 2017-01-10T20:34:36Z

It's rather about too many fields rather than missing fields. But to the point :-), we have some documentation about how read_csv handles malformed files (http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines + docstring on those keywords), which is: raising an error when there are lines with more fields than the number of columns (or skip the line altogether if you specify a keyword). So IMO it would be nice to explain there that you can also make use of usecols to ignore the excessive data (instead of raising or dropping full line).

BTW, it seems already worked in at least 0.18.1, so it's indeed not a bug 'fix' :-)
But a confirmation of the behaviour in tests (which is good!)

gfyoung · 2017-01-11T04:45:38Z

@jorisvandenbossche :

Ah, indeed csv_reader with limited number of columns should should completely disregard the unused fields #8985 was already a non-issue, but 100% agree that confirmation is good.
Fair enough regarding usecols. I'll add an example about that.

Closes pandas-devgh-6710. Closes pandas-devgh-8985.

gfyoung · 2017-01-11T07:37:35Z

@jorisvandenbossche : Added example in the docs regarding usecols for handling malformed rows. Ready to merge if there are no other concerns.

jorisvandenbossche · 2017-01-11T07:57:34Z

@gfyoung Thanks a lot for the addition!

Closes pandas-devgh-6710. Closes pandas-devgh-8985.

sinhrks added Bug IO CSV read_csv, to_csv labels Jan 5, 2017

jreback added this to the 0.20.0 milestone Jan 9, 2017

jorisvandenbossche reviewed Jan 9, 2017

View reviewed changes

gfyoung force-pushed the first-row-csv-incomplete branch from 852300e to bfe01c0 Compare January 10, 2017 05:38

BUG: Patch missing data handling with usecols

6bb558d

Closes pandas-devgh-6710. Closes pandas-devgh-8985.

gfyoung force-pushed the first-row-csv-incomplete branch from bfe01c0 to 6bb558d Compare January 11, 2017 04:52

jorisvandenbossche merged commit 6efb86c into pandas-dev:master Jan 11, 2017

gfyoung deleted the first-row-csv-incomplete branch January 11, 2017 08:02

gfyoung mentioned this pull request Jan 12, 2017

TST: Add another test for uneven CSV rows #15114

Merged

AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017

BUG: Patch missing data handling with usecols (pandas-dev#15066)

c014209

Closes pandas-devgh-6710. Closes pandas-devgh-8985.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Patch missing data handling with usecols #15066

BUG: Patch missing data handling with usecols #15066

gfyoung commented Jan 5, 2017

codecov-io commented Jan 5, 2017 •

edited

Loading

jreback commented Jan 9, 2017

jorisvandenbossche left a comment

jorisvandenbossche Jan 9, 2017

gfyoung Jan 10, 2017

jorisvandenbossche commented Jan 9, 2017

gfyoung commented Jan 10, 2017 •

edited

Loading

jorisvandenbossche commented Jan 10, 2017

gfyoung commented Jan 10, 2017

jorisvandenbossche commented Jan 10, 2017

jorisvandenbossche commented Jan 10, 2017

gfyoung commented Jan 10, 2017

jorisvandenbossche commented Jan 10, 2017 •

edited

Loading

gfyoung commented Jan 11, 2017

gfyoung commented Jan 11, 2017

jorisvandenbossche commented Jan 11, 2017

BUG: Patch missing data handling with usecols #15066

BUG: Patch missing data handling with usecols #15066

Conversation

gfyoung commented Jan 5, 2017

codecov-io commented Jan 5, 2017 • edited Loading

Current coverage is 84.75% (diff: 100%)

jreback commented Jan 9, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 9, 2017

Choose a reason for hiding this comment

gfyoung Jan 10, 2017

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 9, 2017

gfyoung commented Jan 10, 2017 • edited Loading

jorisvandenbossche commented Jan 10, 2017

gfyoung commented Jan 10, 2017

jorisvandenbossche commented Jan 10, 2017

jorisvandenbossche commented Jan 10, 2017

gfyoung commented Jan 10, 2017

jorisvandenbossche commented Jan 10, 2017 • edited Loading

gfyoung commented Jan 11, 2017

gfyoung commented Jan 11, 2017

jorisvandenbossche commented Jan 11, 2017

codecov-io commented Jan 5, 2017 •

edited

Loading

gfyoung commented Jan 10, 2017 •

edited

Loading

jorisvandenbossche commented Jan 10, 2017 •

edited

Loading