-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Avoid flaky usecols set in C engine #14984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -987,24 +987,42 @@ def _evaluate_usecols(usecols, names): | |
|
||
def _validate_usecols_arg(usecols): | ||
""" | ||
Check whether or not the 'usecols' parameter | ||
contains all integers (column selection by index), | ||
strings (column by name) or is a callable. Raises | ||
a ValueError if that is not the case. | ||
Validate the 'usecols' parameter. | ||
|
||
Checks whether or not the 'usecols' parameter contains all integers | ||
(column selection by index), strings (column by name) or is a callable. | ||
Raises a ValueError if that is not the case. | ||
|
||
Parameters | ||
---------- | ||
usecols : array-like, callable, or None | ||
List of columns to use when parsing or a callable that can be used | ||
to filter a list of table columns. | ||
|
||
Returns | ||
------- | ||
usecols_tuple : tuple | ||
A tuple of (verified_usecols, usecols_dtype). | ||
|
||
'verified_usecols' is either a set if an array-like is passed in or | ||
'usecols' if a callable or None is passed in. | ||
|
||
'usecols_dtype` is the inferred dtype of 'usecols' if an array-like | ||
is passed in or None if a callable or None is passed in. | ||
""" | ||
msg = ("'usecols' must either be all strings, all unicode, " | ||
"all integers or a callable") | ||
|
||
if usecols is not None: | ||
if callable(usecols): | ||
return usecols | ||
return usecols, None | ||
usecols_dtype = lib.infer_dtype(usecols) | ||
if usecols_dtype not in ('empty', 'integer', | ||
'string', 'unicode'): | ||
raise ValueError(msg) | ||
|
||
return set(usecols) | ||
return usecols | ||
return set(usecols), usecols_dtype | ||
return usecols, None | ||
|
||
|
||
def _validate_parse_dates_arg(parse_dates): | ||
|
@@ -1473,7 +1491,8 @@ def __init__(self, src, **kwds): | |
self._reader = _parser.TextReader(src, **kwds) | ||
|
||
# XXX | ||
self.usecols = _validate_usecols_arg(self._reader.usecols) | ||
self.usecols, self.usecols_dtype = _validate_usecols_arg( | ||
self._reader.usecols) | ||
|
||
passed_names = self.names is None | ||
|
||
|
@@ -1549,12 +1568,29 @@ def close(self): | |
pass | ||
|
||
def _set_noconvert_columns(self): | ||
""" | ||
Set the columns that should not undergo dtype conversions. | ||
|
||
Currently, any column that is involved with date parsing will not | ||
undergo such conversions. | ||
""" | ||
names = self.orig_names | ||
usecols = self.usecols | ||
if self.usecols_dtype == 'integer': | ||
# A set of integers will be converted to a list in | ||
# the correct order every single time. | ||
usecols = list(self.usecols) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so this is very confusing here. You are now converting to a list. So There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a short docstring for this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback : During @jorisvandenbossche : Added docstring. |
||
elif (callable(self.usecols) or | ||
self.usecols_dtype not in ('empty', None)): | ||
# The names attribute should have the correct columns | ||
# in the proper order for indexing with parse_dates. | ||
usecols = self.names[:] | ||
else: | ||
# Usecols is empty. | ||
usecols = None | ||
|
||
def _set(x): | ||
if usecols and is_integer(x): | ||
x = list(usecols)[x] | ||
if usecols is not None and is_integer(x): | ||
x = usecols[x] | ||
|
||
if not is_integer(x): | ||
x = names.index(x) | ||
|
@@ -1792,7 +1828,7 @@ def __init__(self, f, **kwds): | |
self.skipinitialspace = kwds['skipinitialspace'] | ||
self.lineterminator = kwds['lineterminator'] | ||
self.quoting = kwds['quoting'] | ||
self.usecols = _validate_usecols_arg(kwds['usecols']) | ||
self.usecols, _ = _validate_usecols_arg(kwds['usecols']) | ||
self.skip_blank_lines = kwds['skip_blank_lines'] | ||
|
||
self.names_passed = kwds['names'] or None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this still have a
set
conversion here? is this for a different case? if so, then should happen in your validation below (to make it more consistent), IOW, this always returns tuple of (list of usecols, dtype of usecols). Please make the doc-string indicate that as well (t he doc-string for this function)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully follow your comment here. The reason I return
set
is for thePython
engine. I can clarify the doc-string to returnset
(orNone
) along thedtype
ofusecols
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more clear to have
_validate_use_cols
always return (list, dtype) (for all engines). It makes it very clear what it does. Then have it convert to aset
elsewhere (as you do in_set_noconvert_columns
. otherwise you have 2 places which do this conversion (from list -> set), depending on the engine. again confusing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's always
set
for both engines. This function returns either (independent of engine):set
None
callable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : I updated the documentation to illustrate the return type is independent of engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still, why are we returning a
set
here? why not a list? (for this same case)? then you don't have to rely on the ordering and convert to a list if its integer type?could just do this: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order
then make it into a set at the last moment.
I am nitpicking on this because converting this to a set is happening too early, and it is just confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early conversion to
set
has been used for quite some time, especially with the C engine. Also, alist
is not necessarily going to help with ordering I might add:When I say "order," it is not about maintaining the order in which
usecols
is passed in but about orderingusecols
correctly so that the names or indices are in the same order as they are listed in the data columns. That is why converting at the last moment does not present any benefit AFAICT.