Skip to content

raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.19.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Bug Fixes



- Bug in pd.read_csv - catch missing columns if usecols and header lengths match (:issue:`14671`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to 0.20.0.txt?

Copy link
Member

@gfyoung gfyoung Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. pd.read_csv() instead of pd.read_csv
  2. Let's generalize this a little. This PR is not actually about handling when header and usecols length match. It's about properly handling situations when usecols provides non-existent columns.




Expand Down
14 changes: 8 additions & 6 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1425,12 +1425,14 @@ def __init__(self, src, **kwds):
self.orig_names = self.names[:]

if self.usecols:
if len(self.names) > len(self.usecols):
self.names = [n for i, n in enumerate(self.names)
if (i in self.usecols or n in self.usecols)]

if len(self.names) < len(self.usecols):
raise ValueError("Usecols do not match names.")
if self._reader.file_header is not None:
h = self._reader.file_header[0]
usecol_len = len(set(self.usecols) - set(h))
usecoli_len = len(set(self.usecols) - set(range(0, len(h))))
if usecol_len > 0 and usecoli_len > 0:
raise ValueError("Usecols do not match names.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those code block seems a bit convoluted. Can this be simplified?
Is there a reason you need _reader.file_header. Aren't those already in names?

Possibly this would be easier if this check is performed inside _filter_usecols.
Then we could check at the same time for out of bounds integer usecols (eg usecols=[0,2] if there are only 2 columns)

Copy link
Contributor Author

@GGordonGordon GGordonGordon Jan 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell the way the C parser works is that it will populate the names from the names value passed into the TextReader class. So if I wanted to read columns 'A' and 'C' from a CSV but rename them as 'Category' and 'Total' (names parameter) usecols compares against the names parameter

Another example (which happens with the pandas.io.tests.parser.test_parsers.TestCParserLowMemory test):

        s = """0,1,20140101,0900,4
        0,1,20140102,1000,4"""
        parse_dates = [[1, 2]]
        names = list('acd')

        df = self.read_csv(StringIO(s), names=names,
                           usecols=[0, 2, 3],
                           parse_dates=parse_dates)

Fails because index 3 doesn't exist in the names list passed in (it's only 0-2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my concern here is how should the usecols function if the names parameter is used? The documentation states:
All elements in this array must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s)

Is that to mean that if the names parameter is passed in then usecols filters on those names or does it mean that I should be able to pass in values from either the headers or the names parameter? If it's the former then perhaps I can update the documentation to state "name (if supplied)" under the usecols argument. If it's the latter then how are to know which list to filter based on an integer index? The header or the names?

The other question is should the usecols be applied / filtered first before the names are applied to the columns?


self.names = self._filter_usecols(self.names)

self._set_noconvert_columns()

Expand Down
4 changes: 4 additions & 0 deletions pandas/io/tests/parser/usecols.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,10 @@ def test_usecols(self):
expected.columns = ['foo', 'bar']
tm.assert_frame_equal(result, expected)

# same length but usecols column doesn't exist - see gh-14671
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in a separate test (eg call it test_usecols_non_existing)? And also add an explicit test case for the 'normal' case where the length does not match the number of columns (which already raises), and tests for both cases but using integers.

self.assertRaises(ValueError, self.read_csv, StringIO(data),
usecols=['a', 'b', 'z'])

data = """\
1,2,3
4,5,6
Expand Down
8 changes: 7 additions & 1 deletion pandas/parser.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ cdef class TextReader:
object na_values
object memory_map
object as_recarray
object header, orig_header, names, header_start, header_end
object header, orig_header, names, header_start, header_end, file_header
object index_col
object low_memory
object skiprows
Expand Down Expand Up @@ -775,6 +775,12 @@ cdef class TextReader:
data_line = hr + 1
header.append(this_header)

self.file_header = header[:]

#if self.usecols is not None:
# if len(set(self.usecols) - set(header[0])) > 0 and len(set(self.usecols) - set(range(0,field_count))) > 0:
# raise ValueError("Usecols do not match names.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clean this up?


if self.names is not None:
header = [ self.names ]

Expand Down