Skip to content

BUG: usecols kwarg accepts string when it should only allow list-like or callable. #20558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 1, 2018

Conversation

minggli
Copy link
Contributor

@minggli minggli commented Mar 30, 2018

Since string is iterable, when passed into usecols in read_csv or read_table, it is currently treated as array of characters instead of being caught properly. For example, when usecols='bar', it is interpreted as ['b', 'a' ,'r'] in TextReader, and raising a column not found error in _validate_usecols_names when it really should raise ValueError as invalid value from _validate_usecols_arg, before the param being passed into TextReader.

It's a bug in _validate_usecols_arg and TextReader which lack handling of string-type iterable.

>>> import os
>>> from itertools import repeat
>>> from pandas import *
>>>
>>> dummy = DataFrame({"foo": range(0, 5),
...                    "bar": range(10, 15),
...                    "b": [1, 2, 3, 5, 5],
...                    "a": list(repeat(3, 5)),
...                    "r": list(repeat(8, 5))})
>>>
>>> dummy.to_csv('dummy.csv', index=False)
>>> read_csv('dummy.csv', usecols=['foo'])
   foo
0    0
1    1
2    2
3    3
4    4
>>> read_csv('dummy.csv', usecols='bar')
   b  a  r
0  1  3  8
1  2  3  8
2  3  3  8
3  5  3  8
4  5  3  8
>>>
>>> os.remove('dummy.csv')

Added is_list_like check in _validate_usecols_arg to ensure passed value is list-like but not string. Also, array_like is defined as list_like with dtype attribute so update docstring to list-like from array-like.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Mar 30, 2018
@jreback
Copy link
Contributor

jreback commented Mar 30, 2018

can you add a whatsnew note.

@gfyoung can you review

@@ -97,11 +97,11 @@
MultiIndex is used. If you have a malformed file with delimiters at the end
of each line, you might consider index_col=False to force pandas to _not_
use the first column as the index (row names)
usecols : array-like or callable, default None
Return a subset of the columns. If array-like, all elements must either
usecols : list-like or callable, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this alone actually. array-like will suffice.

Copy link
Contributor Author

@minggli minggli Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. but below is_array_like will evaluate ['foo' ,'bar'] to False. so it's really a list-like IMHO.

def is_array_like(obj):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...that's a fair point. Actually, leave that alone then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you're going to change it here, make sure to do it in io.rst as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. it's done.

self.usecols = usecols
else:
self.usecols = set(usecols)
# GH20529, validate usecols at higher level.
Copy link
Member

@gfyoung gfyoung Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expand the comment to this:

# gh-20558: we will validate and clean "usecols" before getting here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@gfyoung
Copy link
Member

gfyoung commented Mar 30, 2018

This patch looks pretty good! Save a few comments and the whatsnew, this should be merge-able.

@codecov
Copy link

codecov bot commented Mar 30, 2018

Codecov Report

Merging #20558 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20558      +/-   ##
==========================================
- Coverage   91.83%   91.82%   -0.02%     
==========================================
  Files         152      152              
  Lines       49260    49273      +13     
==========================================
+ Hits        45240    45243       +3     
- Misses       4020     4030      +10
Flag Coverage Δ
#multiple 90.2% <100%> (-0.02%) ⬇️
#single 41.9% <42.85%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.46% <100%> (ø) ⬆️
pandas/plotting/_converter.py 65.07% <0%> (-1.74%) ⬇️
pandas/io/clipboard/clipboards.py 30.58% <0%> (-1.6%) ⬇️
pandas/core/internals.py 95.53% <0%> (ø) ⬆️
pandas/io/formats/terminal.py 20.98% <0%> (ø) ⬆️
pandas/core/frame.py 97.19% <0%> (ø) ⬆️
pandas/core/series.py 93.88% <0%> (+0.02%) ⬆️
pandas/util/testing.py 84.73% <0%> (+0.2%) ⬆️
pandas/compat/__init__.py 58.43% <0%> (+0.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a00365...dafc442. Read the comment docs.

@jreback jreback added this to the 0.23.0 milestone Mar 31, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the test for a single string?

@minggli
Copy link
Contributor Author

minggli commented Mar 31, 2018

@jreback added test.

@jreback jreback merged commit 6a22cf7 into pandas-dev:master Apr 1, 2018
@jreback
Copy link
Contributor

jreback commented Apr 1, 2018

thanks @minggli keep em coming!

@minggli
Copy link
Contributor Author

minggli commented Apr 1, 2018

happy to contribute. thanks for reviewing @jreback and @gfyoung !

@minggli minggli deleted the bugfix/usecols branch April 1, 2018 13:50
kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unhelpful error message when loading a single column with read_csv and usecols
3 participants