-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: usecols kwarg accepts string when it should only allow list-like or callable. #20558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can you add a whatsnew note. @gfyoung can you review |
@@ -97,11 +97,11 @@ | |||
MultiIndex is used. If you have a malformed file with delimiters at the end | |||
of each line, you might consider index_col=False to force pandas to _not_ | |||
use the first column as the index (row names) | |||
usecols : array-like or callable, default None | |||
Return a subset of the columns. If array-like, all elements must either | |||
usecols : list-like or callable, default None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave this alone actually. array-like
will suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. but below is_array_like will evaluate ['foo' ,'bar'] to False. so it's really a list-like IMHO.
pandas/pandas/core/dtypes/inference.py
Line 287 in 0a00365
def is_array_like(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...that's a fair point. Actually, leave that alone then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you're going to change it here, make sure to do it in io.rst
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. it's done.
pandas/_libs/parsers.pyx
Outdated
self.usecols = usecols | ||
else: | ||
self.usecols = set(usecols) | ||
# GH20529, validate usecols at higher level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expand the comment to this:
# gh-20558: we will validate and clean "usecols" before getting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
This patch looks pretty good! Save a few comments and the |
Codecov Report
@@ Coverage Diff @@
## master #20558 +/- ##
==========================================
- Coverage 91.83% 91.82% -0.02%
==========================================
Files 152 152
Lines 49260 49273 +13
==========================================
+ Hits 45240 45243 +3
- Misses 4020 4030 +10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the test for a single string?
@jreback added test. |
thanks @minggli keep em coming! |
read_csv
andusecols
#20529git diff upstream/master -u -- "*.py" | flake8 --diff
Since string is iterable, when passed into usecols in read_csv or read_table, it is currently treated as array of characters instead of being caught properly. For example, when usecols='bar', it is interpreted as ['b', 'a' ,'r'] in TextReader, and raising a column not found error in _validate_usecols_names when it really should raise ValueError as invalid value from _validate_usecols_arg, before the param being passed into TextReader.
It's a bug in _validate_usecols_arg and TextReader which lack handling of string-type iterable.
Added is_list_like check in _validate_usecols_arg to ensure passed value is list-like but not string. Also, array_like is defined as list_like with dtype attribute so update docstring to list-like from array-like.