-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1425,12 +1425,14 @@ def __init__(self, src, **kwds): | |
self.orig_names = self.names[:] | ||
|
||
if self.usecols: | ||
if len(self.names) > len(self.usecols): | ||
self.names = [n for i, n in enumerate(self.names) | ||
if (i in self.usecols or n in self.usecols)] | ||
|
||
if len(self.names) < len(self.usecols): | ||
raise ValueError("Usecols do not match names.") | ||
if self._reader.file_header is not None: | ||
h = self._reader.file_header[0] | ||
usecol_len = len(set(self.usecols) - set(h)) | ||
usecoli_len = len(set(self.usecols) - set(range(0, len(h)))) | ||
if usecol_len > 0 and usecoli_len > 0: | ||
raise ValueError("Usecols do not match names.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Those code block seems a bit convoluted. Can this be simplified? Possibly this would be easier if this check is performed inside There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From what I can tell the way the C parser works is that it will populate the names from the names value passed into the TextReader class. So if I wanted to read columns 'A' and 'C' from a CSV but rename them as 'Category' and 'Total' (names parameter) usecols compares against the names parameter Another example (which happens with the pandas.io.tests.parser.test_parsers.TestCParserLowMemory test): s = """0,1,20140101,0900,4
0,1,20140102,1000,4"""
parse_dates = [[1, 2]]
names = list('acd')
df = self.read_csv(StringIO(s), names=names,
usecols=[0, 2, 3],
parse_dates=parse_dates) Fails because index 3 doesn't exist in the names list passed in (it's only 0-2) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @gfyoung There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To clarify my concern here is how should the usecols function if the names parameter is used? The documentation states: Is that to mean that if the names parameter is passed in then usecols filters on those names or does it mean that I should be able to pass in values from either the headers or the names parameter? If it's the former then perhaps I can update the documentation to state "name (if supplied)" under the usecols argument. If it's the latter then how are to know which list to filter based on an integer index? The header or the names? The other question is should the usecols be applied / filtered first before the names are applied to the columns? |
||
|
||
self.names = self._filter_usecols(self.names) | ||
|
||
self._set_noconvert_columns() | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,6 +54,10 @@ def test_usecols(self): | |
expected.columns = ['foo', 'bar'] | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# same length but usecols column doesn't exist - see gh-14671 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you put this in a separate test (eg call it |
||
self.assertRaises(ValueError, self.read_csv, StringIO(data), | ||
usecols=['a', 'b', 'z']) | ||
|
||
data = """\ | ||
1,2,3 | ||
4,5,6 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -290,7 +290,7 @@ cdef class TextReader: | |
object na_values | ||
object memory_map | ||
object as_recarray | ||
object header, orig_header, names, header_start, header_end | ||
object header, orig_header, names, header_start, header_end, file_header | ||
object index_col | ||
object low_memory | ||
object skiprows | ||
|
@@ -775,6 +775,12 @@ cdef class TextReader: | |
data_line = hr + 1 | ||
header.append(this_header) | ||
|
||
self.file_header = header[:] | ||
|
||
#if self.usecols is not None: | ||
# if len(set(self.usecols) - set(header[0])) > 0 and len(set(self.usecols) - set(range(0,field_count))) > 0: | ||
# raise ValueError("Usecols do not match names.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you clean this up? |
||
|
||
if self.names is not None: | ||
header = [ self.names ] | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to 0.20.0.txt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.read_csv()
instead ofpd.read_csv
header
andusecols
length match. It's about properly handling situations whenusecols
provides non-existent columns.