Skip to content

BUG: GH13219 Fixed. Allow unicode values in usecol #13233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

hassanshamim
Copy link
Contributor

@hassanshamim hassanshamim commented May 19, 2016

This is my first time contributing to this project. Let me know what needs to be improved, especially with regards to the test and the bugfix. It was simpler than I imagined which makes me want to say I did it wrong.

@sinhrks sinhrks added Unicode Unicode strings IO CSV read_csv, to_csv labels May 19, 2016
}
expected = DataFrame(data)

df = self.read_csv(StringIO(s), usecols=[u'AAA', u'BBB'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test mixed str, like usecols=[u'AAA', 'BBB']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mix strings like usecols=[u'AAA', 'BBB'] be successful? Or should it throw out a more descriptive error, like ValueError: The elements of 'usecols' must all be of the same type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that already should raise because

In [2]: pd.lib.infer_dtype([u'AA', 'BB'])
Out[2]: 'mixed'

(certainly add a test for it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it should success. It is quite popular in countries using 2bytes. I don't think checking all dtypes in "mixed" can be a bottleneck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought something like:

if usecols_dtype == "mixed" and not all(isinstance(x, compat.string_types)):
    raise...

Because usecols length is less likely to be too long.

Copy link
Member

@gfyoung gfyoung May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : in reference to what you said here, then what @sinhrks is suggesting above should not have been implemented at all.

To be fair, @sinhrks I don't quite understand the rationale you give for accepting a mixture of unicode and string since this iterative checking does seem cumbersome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working as expected, so something else is going on.

In [4]: pd.lib.infer_dtype(u'ああ,いい,ううう,ええええ'.split(','))
Out[4]: 'unicode'

In [5]: pd.lib.infer_dtype([u'A',u'B'])
Out[5]: 'unicode'

In [6]: pd.lib.infer_dtype([u'A','B'])
Out[6]: 'mixed'

In [7]: pd.lib.infer_dtype(['A','B'])
Out[7]: 'string'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : I'm confused...what exactly were you checking here? Not sure if it is related to this discussion here with testing with mixed string-type usecols (see first comment by @sinhrks )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at the code, this is EXACTLY what it does. it already check string and integer, just add unicode and it should work. if it doesn't (and I guess the multi-byte is failing it), then there is something else to look at that is causing a failure.

@hassanshamim pls step thru and see where it fails. This PR is getting way overcommented with a really simple thing.

@hassanshamim
Copy link
Contributor Author

Currently the multibyte usecol tests are throwing errors for the test parsers. Is this expected? If so I'll adjust the tests.


======================================================================
ERROR: test_usecols_with_multibyte_unicode_characters (pandas.io.tests.parser.test_parsers.TestCParserHighMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/usecols.py", line 309, in test_usecols_with_multibyte_unicode_characters
    df = self.read_csv(StringIO(s), usecols=[u'あああ', u'いい'])
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/test_parsers.py", line 57, in read_csv
    return read_csv(*args, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1257, in __init__
    raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.

======================================================================
ERROR: test_usecols_with_multibyte_unicode_characters (pandas.io.tests.parser.test_parsers.TestCParserLowMemory)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/usecols.py", line 309, in test_usecols_with_multibyte_unicode_characters
    df = self.read_csv(StringIO(s), usecols=[u'あああ', u'いい'])
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/test_parsers.py", line 76, in read_csv
    return read_csv(*args, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1257, in __init__
    raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.

======================================================================
ERROR: test_usecols_with_multibyte_unicode_characters (pandas.io.tests.parser.test_parsers.TestPythonParser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/usecols.py", line 309, in test_usecols_with_multibyte_unicode_characters
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/tests/parser/test_parsers.py", line 100, in read_csv
    return read_csv(*args, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 805, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1608, in __init__
    self.columns, self.num_original_columns = self._infer_columns()
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1823, in _infer_columns
    line = self._buffered_line()
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 1975, in _buffered_line
    return self._next_line()
  File "/Users/avendesora/Dropbox/code/python/pandas-hassan/pandas/io/parsers.py", line 2006, in _next_line
    orig_line = next(self.data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

----------------------------------------------------------------------
Ran 587 tests in 111.477s

FAILED (SKIP=21, errors=3)

@@ -154,3 +154,4 @@ Bug Fixes
- Bug in ``Period`` addition raises ``TypeError`` if ``Period`` is on right hand side (:issue:`13069`)
- Bug in ``Peirod`` and ``Series`` or ``Index`` comparison raises ``TypeError`` (:issue:`13200`)
- Bug in ``pd.set_eng_float_format()`` that would prevent NaN's from formatting (:issue:`11981`)
- Bug in ``pd.read_csv()`` that prevents ``usecol`` kwarg from accepting unicode (:issue:`13219`)
Copy link
Member

@sinhrks sinhrks May 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can't say "accepting unicode" in a broad sense, as there is a limitation for 2bytes.

@sinhrks
Copy link
Member

sinhrks commented May 20, 2016

This should work, but currently not. Can we keep open #13219 to handle 2 bytes?

@jreback
Copy link
Contributor

jreback commented May 20, 2016

how about we comment out these 2 byte tests for now (and make a new issue)?

@hassanshamim can you do that & rebase?

@hassanshamim
Copy link
Contributor Author

Can do!

"must either be all strings "
"or all integers"))

if usecols_dtype == 'mixed':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary, only the bottom check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the usecols_dtype not in ('integer', 'string', 'unicode'):? I took @sinhrks advice and by checking if the type is mixed and ensuring they are all string_type, it allows usecols of mixed encoding strings, and test_usecols_with_mixed_encoding_strings passes.

Or if you mean just have the if not all(map(lambda x: isinstance(x, string_types), usecols)) line, then just having that would fail with mixed integer and strings, as those are mixed-integer dtype rather than just mixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only valid types are string, integer, Unicode

mixed is not allowed (your check is duplicative)

Copy link
Member

@gfyoung gfyoung May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : that's in conflict with what @sinhrks is suggested below and with @hassanshamim did. Perhaps that should be sorted out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c - I think then that the infer types is failing on the multi byte Unicode check (which is what should be fixed)

Copy link
Member

@gfyoung gfyoung May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the errors that @hassanshamim is getting when running the multibyte, but what I was suggesting was that there should be a nice helper function to check for a mixed array of unicode and string in inference.pyx to avoid this cumbersome iteration checking (we might be talking about different things here?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed strings and Unicode are an error full stop

Copy link
Member

@gfyoung gfyoung May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : that's where @sinhrks disagrees, hence my comment previously. I thought you had understood that when you said "i c"?

I wasn't referring anything at all to the multi-byte issue. Not sure why that is failing, especially since there are no issues when those usecols strings are not unicode as I mentioned in another comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strings are either Unicode or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I remove the first check and revert to the original, where passing mixed string and unicode i.e. usecols=[u'AA', 'BB'] fails?

@hassanshamim
Copy link
Contributor Author

hassanshamim commented May 21, 2016

I think that's everything. I've been squashing my new commits and just force pushing to update this branch. Is that the normal process for incorporating feedback on PRs? Or should I have just left multiple commits then squashed once this was all ready?

usecols=['AAA', u'BBB'])
self.assertRaises(ValueError, self.read_csv, StringIO(s),
usecols=[u'AAA', 'BBB'])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assertRaisesRegexp from pandas.util.testing to test that you get the right error message. That's stronger than assertRaises.

@gfyoung
Copy link
Member

gfyoung commented May 21, 2016

@hassanshamim :

  1. You forgot to add test for single-byte unicode usecols (i.e. u'A')
  2. Regarding squashing, I think that in this case, it is okay to squash because this really should be one commit. In cases where there might be a lot more changing, multiple commits would be preferred.

@jreback jreback added this to the 0.18.2 milestone May 31, 2016
@jreback
Copy link
Contributor

jreback commented May 31, 2016

this looks good. @gfyoung any further comments?

@@ -882,12 +882,13 @@ def _validate_usecols_arg(usecols):
or strings (column by name). Raises a ValueError
if that is not the case.
"""
msg = ("The elements of 'usecols' must "
"either be all strings or all integers")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to raise on a mixture of unicode and string, the current error message is not very useful in that case. Either add to this error to account for that case, OR have a different error message.

@gfyoung
Copy link
Member

gfyoung commented May 31, 2016

@hassanshamim : just a few comments, but almost there!

@hassanshamim
Copy link
Contributor Author

@gfyoung updated with your recommendations. Are these formats okay?

def test_usecols_with_multibyte_unicode_characters(self):
    raise nose.SkipTest('TODO: see gh-13253')
    # actual test code...

New error message:

msg = ("The elements of 'usecols' must "
            "either be all strings, all unicode, or all integers")

@gfyoung
Copy link
Member

gfyoung commented May 31, 2016

@hassanshamim : yep, that looks good, so LGTM now. Just ping when tests pass!

@hassanshamim
Copy link
Contributor Author

@gfyoung tests have gone through.

@jreback jreback closed this in fcd73ad Jun 1, 2016
@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

thanks!

@hassanshamim hassanshamim deleted the bug-13219 branch June 1, 2016 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode not acceptable input for usecols kwarg in read_csv()
4 participants