BUG: GH13219 Fixed. Allow unicode values in usecol #13233

hassanshamim · 2016-05-19T21:27:26Z

closes Unicode not acceptable input for usecols kwarg in read_csv() #13219
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

This is my first time contributing to this project. Let me know what needs to be improved, especially with regards to the test and the bugfix. It was simpler than I imagined which makes me want to say I did it wrong.

sinhrks · 2016-05-19T21:44:55Z

pandas/io/tests/parser/usecols.py

+        }
+        expected = DataFrame(data)
+
+        df = self.read_csv(StringIO(s), usecols=[u'AAA', u'BBB'])


Can you also test mixed str, like usecols=[u'AAA', 'BBB']

Good idea. Will update.

Should mix strings like usecols=[u'AAA', 'BBB'] be successful? Or should it throw out a more descriptive error, like ValueError: The elements of 'usecols' must all be of the same type

that already should raise because

In [2]: pd.lib.infer_dtype([u'AA', 'BB']) Out[2]: 'mixed'

(certainly add a test for it)

Ideally it should success. It is quite popular in countries using 2bytes. I don't think checking all dtypes in "mixed" can be a bottleneck.

I thought something like:

if usecols_dtype == "mixed" and not all(isinstance(x, compat.string_types)): raise...

Because usecols length is less likely to be too long.

@jreback : in reference to what you said here, then what @sinhrks is suggesting above should not have been implemented at all.

To be fair, @sinhrks I don't quite understand the rationale you give for accepting a mixture of unicode and string since this iterative checking does seem cumbersome.

This is working as expected, so something else is going on.

In [4]: pd.lib.infer_dtype(u'ああ,いい,ううう,ええええ'.split(',')) Out[4]: 'unicode' In [5]: pd.lib.infer_dtype([u'A',u'B']) Out[5]: 'unicode' In [6]: pd.lib.infer_dtype([u'A','B']) Out[6]: 'mixed' In [7]: pd.lib.infer_dtype(['A','B']) Out[7]: 'string'

@jreback : I'm confused...what exactly were you checking here? Not sure if it is related to this discussion here with testing with mixed string-type usecols (see first comment by @sinhrks )

look at the code, this is EXACTLY what it does. it already check string and integer, just add unicode and it should work. if it doesn't (and I guess the multi-byte is failing it), then there is something else to look at that is causing a failure.

@hassanshamim pls step thru and see where it fails. This PR is getting way overcommented with a really simple thing.

hassanshamim · 2016-05-20T04:32:37Z

Currently the multibyte usecol tests are throwing errors for the test parsers. Is this expected? If so I'll adjust the tests.


======================================================================
ERROR: test_usecols_with_multibyte_unicode_characters (pandas.io.tests.parser.test_parsers.TestCParserHighMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/usecols.py", line 309, in test_usecols_with_multibyte_unicode_characters
    df = self.read_csv(StringIO(s), usecols=[u'あああ', u'いい'])
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/test_parsers.py", line 57, in read_csv
    return read_csv(*args, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1257, in __init__
    raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.

======================================================================
ERROR: test_usecols_with_multibyte_unicode_characters (pandas.io.tests.parser.test_parsers.TestCParserLowMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/usecols.py", line 309, in test_usecols_with_multibyte_unicode_characters
    df = self.read_csv(StringIO(s), usecols=[u'あああ', u'いい'])
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/test_parsers.py", line 76, in read_csv
    return read_csv(*args, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1257, in __init__
    raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.

======================================================================
ERROR: test_usecols_with_multibyte_unicode_characters (pandas.io.tests.parser.test_parsers.TestPythonParser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/usecols.py", line 309, in test_usecols_with_multibyte_unicode_characters
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/test_parsers.py", line 100, in read_csv
    return read_csv(*args, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 805, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1608, in __init__
    self.columns, self.num_original_columns = self._infer_columns()
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1823, in _infer_columns
    line = self._buffered_line()
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1975, in _buffered_line
    return self._next_line()
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 2006, in _next_line
    orig_line = next(self.data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

----------------------------------------------------------------------
Ran 587 tests in 111.477s

FAILED (SKIP=21, errors=3)

sinhrks · 2016-05-20T22:03:37Z

doc/source/whatsnew/v0.18.2.txt

@@ -154,3 +154,4 @@ Bug Fixes
 - Bug in ``Period`` addition raises ``TypeError`` if ``Period`` is on right hand side (:issue:`13069`)
 - Bug in ``Peirod`` and ``Series`` or ``Index`` comparison raises ``TypeError`` (:issue:`13200`)
 - Bug in ``pd.set_eng_float_format()`` that would prevent NaN's from formatting (:issue:`11981`)
+- Bug in ``pd.read_csv()`` that prevents ``usecol`` kwarg from accepting unicode (:issue:`13219`)


So we can't say "accepting unicode" in a broad sense, as there is a limitation for 2bytes.

sinhrks · 2016-05-20T22:07:42Z

This should work, but currently not. Can we keep open #13219 to handle 2 bytes?

jreback · 2016-05-20T22:11:01Z

how about we comment out these 2 byte tests for now (and make a new issue)?

@hassanshamim can you do that & rebase?

hassanshamim · 2016-05-21T02:02:21Z

Can do!

jreback · 2016-05-21T14:22:59Z

pandas/io/parsers.py

-                              "must either be all strings "
-                              "or all integers"))
+
+        if usecols_dtype == 'mixed':


I don't think this is necessary, only the bottom check.

Do you mean the usecols_dtype not in ('integer', 'string', 'unicode'):? I took @sinhrks advice and by checking if the type is mixed and ensuring they are all string_type, it allows usecols of mixed encoding strings, and test_usecols_with_mixed_encoding_strings passes.

Or if you mean just have the if not all(map(lambda x: isinstance(x, string_types), usecols)) line, then just having that would fail with mixed integer and strings, as those are mixed-integer dtype rather than just mixed.

only valid types are string, integer, Unicode

mixed is not allowed (your check is duplicative)

@jreback : that's in conflict with what @sinhrks is suggested below and with @hassanshamim did. Perhaps that should be sorted out?

i c - I think then that the infer types is failing on the multi byte Unicode check (which is what should be fixed)

I can't comment on the errors that @hassanshamim is getting when running the multibyte, but what I was suggesting was that there should be a nice helper function to check for a mixed array of unicode and string in inference.pyx to avoid this cumbersome iteration checking (we might be talking about different things here?)

mixed strings and Unicode are an error full stop

@jreback : that's where @sinhrks disagrees, hence my comment previously. I thought you had understood that when you said "i c"?

I wasn't referring anything at all to the multi-byte issue. Not sure why that is failing, especially since there are no issues when those usecols strings are not unicode as I mentioned in another comment.

no strings are either Unicode or not

So should I remove the first check and revert to the original, where passing mixed string and unicode i.e. usecols=[u'AA', 'BB'] fails?

hassanshamim · 2016-05-21T23:25:32Z

Created Issue BUG: pd.read_csv() fails when usecols contains multibyte unicode values #13253
Converted data strings from unicode to str in tests
Added test for multibyte characters in usecol (not unicode)
Reverted change that allowed mixed dtype usecol values
Rebased onto upstream/master

I think that's everything. I've been squashing my new commits and just force pushing to update this branch. Is that the normal process for incorporating feedback on PRs? Or should I have just left multiple commits then squashed once this was all ready?

gfyoung · 2016-05-21T23:34:42Z

pandas/io/tests/parser/usecols.py

+                          usecols=['AAA', u'BBB'])
+        self.assertRaises(ValueError, self.read_csv, StringIO(s),
+                          usecols=[u'AAA', 'BBB'])
+


Use assertRaisesRegexp from pandas.util.testing to test that you get the right error message. That's stronger than assertRaises.

gfyoung · 2016-05-21T23:37:23Z

@hassanshamim :

You forgot to add test for single-byte unicode usecols (i.e. u'A')
Regarding squashing, I think that in this case, it is okay to squash because this really should be one commit. In cases where there might be a lot more changing, multiple commits would be preferred.

jreback · 2016-05-31T20:32:48Z

this looks good. @gfyoung any further comments?

gfyoung · 2016-05-31T20:43:41Z

pandas/io/parsers.py

@@ -882,12 +882,13 @@ def _validate_usecols_arg(usecols):
    or strings (column by name). Raises a ValueError
    if that is not the case.
    """
+    msg = ("The elements of 'usecols' must "
+           "either be all strings or all integers")
+


If we're going to raise on a mixture of unicode and string, the current error message is not very useful in that case. Either add to this error to account for that case, OR have a different error message.

gfyoung · 2016-05-31T20:45:35Z

@hassanshamim : just a few comments, but almost there!

hassanshamim · 2016-05-31T21:56:12Z

@gfyoung updated with your recommendations. Are these formats okay?

def test_usecols_with_multibyte_unicode_characters(self):
    raise nose.SkipTest('TODO: see gh-13253')
    # actual test code...

New error message:

msg = ("The elements of 'usecols' must "
            "either be all strings, all unicode, or all integers")

gfyoung · 2016-05-31T21:57:31Z

@hassanshamim : yep, that looks good, so LGTM now. Just ping when tests pass!

hassanshamim · 2016-06-01T00:29:17Z

@gfyoung tests have gone through.

jreback · 2016-06-01T11:13:44Z

thanks!

sinhrks added Unicode Unicode strings IO CSV read_csv, to_csv labels May 19, 2016

sinhrks reviewed May 19, 2016
View reviewed changes

hassanshamim force-pushed the bug-13219 branch from 367b905 to 5d1e6ec Compare May 20, 2016 04:24

sinhrks reviewed May 20, 2016
View reviewed changes

hassanshamim force-pushed the bug-13219 branch from 5d1e6ec to 71f5139 Compare May 21, 2016 07:17

jreback reviewed May 21, 2016
View reviewed changes

hassanshamim mentioned this pull request May 21, 2016

BUG: pd.read_csv() fails when usecols contains multibyte unicode values #13253

Closed

hassanshamim force-pushed the bug-13219 branch from 71f5139 to 2b4d907 Compare May 21, 2016 23:17

gfyoung reviewed May 21, 2016
View reviewed changes

hassanshamim force-pushed the bug-13219 branch from 2b4d907 to 17124ab Compare May 31, 2016 19:32

jreback added this to the 0.18.2 milestone May 31, 2016

gfyoung reviewed May 31, 2016
View reviewed changes

BUG: GH13219 Fixed. Allow unicode values in usecol

c30eeb5

hassanshamim force-pushed the bug-13219 branch from 17124ab to c30eeb5 Compare May 31, 2016 21:52

jreback closed this in fcd73ad Jun 1, 2016

hassanshamim deleted the bug-13219 branch June 1, 2016 23:39

BUG: GH13219 Fixed. Allow unicode values in usecol #13233

BUG: GH13219 Fixed. Allow unicode values in usecol #13233

Conversation

hassanshamim commented May 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung May 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hassanshamim commented May 20, 2016

sinhrks May 20, 2016 • edited Loading

Choose a reason for hiding this comment

sinhrks commented May 20, 2016

jreback commented May 20, 2016

hassanshamim commented May 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung May 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung May 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung May 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hassanshamim commented May 21, 2016 • edited Loading

Choose a reason for hiding this comment

gfyoung commented May 21, 2016 • edited Loading

jreback commented May 31, 2016

Choose a reason for hiding this comment

gfyoung commented May 31, 2016

hassanshamim commented May 31, 2016

gfyoung commented May 31, 2016

hassanshamim commented Jun 1, 2016

jreback commented Jun 1, 2016

hassanshamim commented May 19, 2016 •

edited

Loading

gfyoung May 21, 2016 •

edited

Loading

sinhrks May 20, 2016 •

edited

Loading

gfyoung May 21, 2016 •

edited

Loading

gfyoung May 21, 2016 •

edited

Loading

gfyoung May 21, 2016 •

edited

Loading

hassanshamim commented May 21, 2016 •

edited

Loading

gfyoung commented May 21, 2016 •

edited

Loading