Skip to content

BUG: GH13219 Fixed. Allow unicode values in usecol #13233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.18.2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -372,3 +372,4 @@ Bug Fixes


- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)
- Bug in ``pd.read_csv()`` that prevents ``usecol`` kwarg from accepting single-byte unicode strings (:issue:`13219`)
9 changes: 5 additions & 4 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -882,12 +882,13 @@ def _validate_usecols_arg(usecols):
or strings (column by name). Raises a ValueError
if that is not the case.
"""
msg = ("The elements of 'usecols' must "
"either be all strings, all unicode, or all integers")

if usecols is not None:
usecols_dtype = lib.infer_dtype(usecols)
if usecols_dtype not in ('integer', 'string'):
raise ValueError(("The elements of 'usecols' "
"must either be all strings "
"or all integers"))
if usecols_dtype not in ('integer', 'string', 'unicode'):
raise ValueError(msg)

return usecols

Expand Down
106 changes: 103 additions & 3 deletions pandas/io/tests/parser/usecols.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
"""

from datetime import datetime
import nose

import pandas.util.testing as tm

Expand All @@ -22,9 +23,8 @@ def test_raise_on_mixed_dtype_usecols(self):
1000,2000,3000
4000,5000,6000
"""
msg = ("The elements of \'usecols\' "
"must either be all strings "
"or all integers")
msg = ("The elements of 'usecols' must "
"either be all strings, all unicode, or all integers")
usecols = [0, 'b', 2]

with tm.assertRaisesRegexp(ValueError, msg):
Expand Down Expand Up @@ -254,3 +254,103 @@ def test_usecols_with_parse_dates_and_usecol_names(self):
usecols=[3, 0, 2],
parse_dates=parse_dates)
tm.assert_frame_equal(df, expected)

def test_usecols_with_unicode_strings(self):
# see gh-13219

s = '''AAA,BBB,CCC,DDD
0.056674973,8,True,a
2.613230982,2,False,b
3.568935038,7,False,a
'''

data = {
'AAA': {
0: 0.056674972999999997,
1: 2.6132309819999997,
2: 3.5689350380000002
},
'BBB': {0: 8, 1: 2, 2: 7}
}
expected = DataFrame(data)

df = self.read_csv(StringIO(s), usecols=[u'AAA', u'BBB'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test mixed str, like usecols=[u'AAA', 'BBB']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mix strings like usecols=[u'AAA', 'BBB'] be successful? Or should it throw out a more descriptive error, like ValueError: The elements of 'usecols' must all be of the same type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that already should raise because

In [2]: pd.lib.infer_dtype([u'AA', 'BB'])
Out[2]: 'mixed'

(certainly add a test for it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it should success. It is quite popular in countries using 2bytes. I don't think checking all dtypes in "mixed" can be a bottleneck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought something like:

if usecols_dtype == "mixed" and not all(isinstance(x, compat.string_types)):
    raise...

Because usecols length is less likely to be too long.

Copy link
Member

@gfyoung gfyoung May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : in reference to what you said here, then what @sinhrks is suggesting above should not have been implemented at all.

To be fair, @sinhrks I don't quite understand the rationale you give for accepting a mixture of unicode and string since this iterative checking does seem cumbersome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working as expected, so something else is going on.

In [4]: pd.lib.infer_dtype(u'ああ,いい,ううう,ええええ'.split(','))
Out[4]: 'unicode'

In [5]: pd.lib.infer_dtype([u'A',u'B'])
Out[5]: 'unicode'

In [6]: pd.lib.infer_dtype([u'A','B'])
Out[6]: 'mixed'

In [7]: pd.lib.infer_dtype(['A','B'])
Out[7]: 'string'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : I'm confused...what exactly were you checking here? Not sure if it is related to this discussion here with testing with mixed string-type usecols (see first comment by @sinhrks )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the code, this is EXACTLY what it does. it already check string and integer, just add unicode and it should work. if it doesn't (and I guess the multi-byte is failing it), then there is something else to look at that is causing a failure.

@hassanshamim pls step thru and see where it fails. This PR is getting way overcommented with a really simple thing.

tm.assert_frame_equal(df, expected)

def test_usecols_with_single_byte_unicode_strings(self):
# see gh-13219

s = '''A,B,C,D
0.056674973,8,True,a
2.613230982,2,False,b
3.568935038,7,False,a
'''

data = {
'A': {
0: 0.056674972999999997,
1: 2.6132309819999997,
2: 3.5689350380000002
},
'B': {0: 8, 1: 2, 2: 7}
}
expected = DataFrame(data)

df = self.read_csv(StringIO(s), usecols=[u'A', u'B'])
tm.assert_frame_equal(df, expected)

def test_usecols_with_mixed_encoding_strings(self):
s = '''AAA,BBB,CCC,DDD
0.056674973,8,True,a
2.613230982,2,False,b
3.568935038,7,False,a
'''

msg = ("The elements of 'usecols' must "
"either be all strings, all unicode, or all integers")

with tm.assertRaisesRegexp(ValueError, msg):
self.read_csv(StringIO(s), usecols=[u'AAA', b'BBB'])

with tm.assertRaisesRegexp(ValueError, msg):
self.read_csv(StringIO(s), usecols=[b'AAA', u'BBB'])

def test_usecols_with_multibyte_characters(self):
s = '''あああ,いい,ううう,ええええ
0.056674973,8,True,a
2.613230982,2,False,b
3.568935038,7,False,a
'''
data = {
'あああ': {
0: 0.056674972999999997,
1: 2.6132309819999997,
2: 3.5689350380000002
},
'いい': {0: 8, 1: 2, 2: 7}
}
expected = DataFrame(data)

df = self.read_csv(StringIO(s), usecols=['あああ', 'いい'])
tm.assert_frame_equal(df, expected)

def test_usecols_with_multibyte_unicode_characters(self):
raise nose.SkipTest('TODO: see gh-13253')

s = '''あああ,いい,ううう,ええええ
0.056674973,8,True,a
2.613230982,2,False,b
3.568935038,7,False,a
'''
data = {
'あああ': {
0: 0.056674972999999997,
1: 2.6132309819999997,
2: 3.5689350380000002
},
'いい': {0: 8, 1: 2, 2: 7}
}
expected = DataFrame(data)

df = self.read_csv(StringIO(s), usecols=[u'あああ', u'いい'])
tm.assert_frame_equal(df, expected)