-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Avoid flaky usecols set in C engine #14984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 84.78% (diff: 100%)@@ master #14984 diff @@
==========================================
Files 145 145
Lines 51090 51094 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43315 43319 +4
Misses 7775 7775
Partials 0 0
|
usecols_dtype = lib.infer_dtype(usecols) | ||
if usecols_dtype not in ('empty', 'integer', | ||
'string', 'unicode'): | ||
raise ValueError(msg) | ||
|
||
return set(usecols) | ||
return usecols | ||
return set(usecols), usecols_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this still have a set
conversion here? is this for a different case? if so, then should happen in your validation below (to make it more consistent), IOW, this always returns tuple of (list of usecols, dtype of usecols). Please make the doc-string indicate that as well (t he doc-string for this function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I fully follow your comment here. The reason I return set
is for the Python
engine. I can clarify the doc-string to return set
(or None
) along the dtype
of usecols
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more clear to have _validate_use_cols
always return (list, dtype) (for all engines). It makes it very clear what it does. Then have it convert to a set
elsewhere (as you do in _set_noconvert_columns
. otherwise you have 2 places which do this conversion (from list -> set), depending on the engine. again confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's always set
for both engines. This function returns either (independent of engine):
set
None
callable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : I updated the documentation to illustrate the return type is independent of engine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still, why are we returning a set
here? why not a list? (for this same case)? then you don't have to rely on the ordering and convert to a list if its integer type?
could just do this: http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order
then make it into a set at the last moment.
I am nitpicking on this because converting this to a set is happening too early, and it is just confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early conversion to set
has been used for quite some time, especially with the C engine. Also, a list
is not necessarily going to help with ordering I might add:
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b,c\n1,2,3'
>>> read_csv(StringIO(data), usecols=[2, 1])
b c
0 2 3
When I say "order," it is not about maintaining the order in which usecols
is passed in but about ordering usecols
correctly so that the names or indices are in the same order as they are listed in the data columns. That is why converting at the last moment does not present any benefit AFAICT.
30bae02
to
6ac5814
Compare
if self.usecols_dtype == 'integer': | ||
# A set of integers will be converted to a list in | ||
# the correct order every single time. | ||
usecols = list(self.usecols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is very confusing here. You are now converting to a list. So usecols
ends up as a list, None or a callable? when / how does it end up as a set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a short docstring for this _set_noconvert_columns
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : During __init__
, both functions call _validate_usecols_arg
when setting self.usecols
, which returns either a callable, set
, or None
.
@jorisvandenbossche : Added docstring.
6ac5814
to
82cf55b
Compare
@jreback , @jorisvandenbossche : Addressed comments, and everything is green. |
ok thanks @gfyoung |
Explanation of the bug can be found here.
Closes #14792.