Skip to content

BUG: Patch missing data handling with usecols #15066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jan 5, 2017

Patch handling of cases when the first row of a CSV is incomplete and usecols is specified.

Closes #6710.
Closes #8985.

xref #14782.

@codecov-io
Copy link

codecov-io commented Jan 5, 2017

Current coverage is 84.75% (diff: 100%)

Merging #15066 into master will decrease coverage by <.01%

@@             master     #15066   diff @@
==========================================
  Files           145        145          
  Lines         51220      51220          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43415      43413     -2   
- Misses         7805       7807     +2   
  Partials          0          0          

Powered by Codecov. Last update c71f214...6bb558d

@sinhrks sinhrks added Bug IO CSV read_csv, to_csv labels Jan 5, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 9, 2017
@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

@jorisvandenbossche can you have a look.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and no concern regarding fix/test for #6710.

For issue #8985, I am a bit less sure this is necessarily the right thing. In any case, it is nowhere mentioned that usecols can actually be used for this case. So if we fix this 'bug', I would rather regard it as an enhancement to read in malformed files, and also document this.

df = self.read_csv(StringIO(data), names=names, usecols=usecols)
tm.assert_frame_equal(df, expected)

# see gh-8985
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in a separate test that is called differently? As AFAIK this one is not about an incomplete first row (multiple rows are shorter, so it is inferred that there are 3 columns) But it is rather about ignoring excessive values on certain columns when using usecols?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

@jorisvandenbossche
Copy link
Member

Regarding the 'malformed' lines (rows with too many values), we eg also have error_bad_lines and warn_bad_lines. Although this will 'drop' the full bad line instead of dropping only the excessive values on the line.
Slightly related, there is also #9729 which asks for similar handling for bad lines with too few values. So maybe we should have a more general discussion about how to handle bad lines (where the options are raise/drop full line/'process' combined with a warning or not in case of not raising)

@gfyoung
Copy link
Member Author

gfyoung commented Jan 10, 2017

@jorisvandenbossche : Let me address your comments separately:

  1. csv_reader with limited number of columns should should completely disregard the unused fields #8985 is a bug, not an enhancement. The usecols parameter was 100% valid in the test case presented from the issue. I fail to see what new thing needs to be documented here.

  2. "Bad" lines with too few fields #9729 is tangentially related to this issue because it has to deal with an inconsistent number of fields across rows. usecols should not be involved in that discussion. I don't believe this discussion needs include this PR.

@gfyoung gfyoung force-pushed the first-row-csv-incomplete branch from 852300e to bfe01c0 Compare January 10, 2017 05:38
@jorisvandenbossche
Copy link
Member

#8985 is a bug, not an enhancement. The usecols parameter was 100% valid in the test case presented from the issue. I fail to see what new thing needs to be documented here.

Can you point me to documentation and/or tests that specifies this behaviour of usecols ?

@gfyoung
Copy link
Member Author

gfyoung commented Jan 10, 2017

The usecols parameter is specifying three columns. Every row has three columns, so the parameter is perfectly valid. We should be able to handle it just fine.

@jorisvandenbossche
Copy link
Member

We should be able to handle it just fine.

I am not saying that we necessarily shouldn't do this (the use case is certainly useful). My request above was just, if we fix this, to document this as a feature of usecols

@jorisvandenbossche
Copy link
Member

BTW, the example case for #8985 actually already seems to work on latest master?

In [21]: usecols = [0, 1, 2]

In [22]: data = '19,29,39\n' * 2 + '10,20,30,40'

In [25]: pd.read_csv(StringIO(data), header=None, usecols=usecols)
Out[25]: 
    0   1   2
0  19  29  39
1  19  29  39
2  10  20  30

In [26]: pd.read_csv(StringIO(data), header=None, usecols=usecols, engine='python')
Out[26]: 
    0   1   2
0  19  29  39
1  19  29  39
2  10  20  30

In [27]: pd.__version__
Out[27]: '0.19.0+317.g41170d4'

@gfyoung
Copy link
Member Author

gfyoung commented Jan 10, 2017

Handling missing data is independent of usecols IMO. That's why this isn't a "feature" for me.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 10, 2017

It's rather about too many fields rather than missing fields. But to the point :-), we have some documentation about how read_csv handles malformed files (http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines + docstring on those keywords), which is: raising an error when there are lines with more fields than the number of columns (or skip the line altogether if you specify a keyword). So IMO it would be nice to explain there that you can also make use of usecols to ignore the excessive data (instead of raising or dropping full line).

BTW, it seems already worked in at least 0.18.1, so it's indeed not a bug 'fix' :-)
But a confirmation of the behaviour in tests (which is good!)

@gfyoung
Copy link
Member Author

gfyoung commented Jan 11, 2017

@jorisvandenbossche :

  1. Ah, indeed csv_reader with limited number of columns should should completely disregard the unused fields #8985 was already a non-issue, but 100% agree that confirmation is good.

  2. Fair enough regarding usecols. I'll add an example about that.

@gfyoung gfyoung force-pushed the first-row-csv-incomplete branch from bfe01c0 to 6bb558d Compare January 11, 2017 04:52
@gfyoung
Copy link
Member Author

gfyoung commented Jan 11, 2017

@jorisvandenbossche : Added example in the docs regarding usecols for handling malformed rows. Ready to merge if there are no other concerns.

@jorisvandenbossche jorisvandenbossche merged commit 6efb86c into pandas-dev:master Jan 11, 2017
@jorisvandenbossche
Copy link
Member

@gfyoung Thanks a lot for the addition!

@gfyoung gfyoung deleted the first-row-csv-incomplete branch January 11, 2017 08:02
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants