-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GH13219 Fixed. Allow unicode values in usecol #13233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
expected = DataFrame(data) | ||
|
||
df = self.read_csv(StringIO(s), usecols=[u'AAA', u'BBB']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test mixed str, like usecols=[u'AAA', 'BBB']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mix strings like usecols=[u'AAA', 'BBB']
be successful? Or should it throw out a more descriptive error, like ValueError: The elements of 'usecols' must all be of the same type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that already should raise because
In [2]: pd.lib.infer_dtype([u'AA', 'BB'])
Out[2]: 'mixed'
(certainly add a test for it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally it should success. It is quite popular in countries using 2bytes. I don't think checking all dtypes in "mixed" can be a bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought something like:
if usecols_dtype == "mixed" and not all(isinstance(x, compat.string_types)):
raise...
Because usecols
length is less likely to be too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is working as expected, so something else is going on.
In [4]: pd.lib.infer_dtype(u'ああ,いい,ううう,ええええ'.split(','))
Out[4]: 'unicode'
In [5]: pd.lib.infer_dtype([u'A',u'B'])
Out[5]: 'unicode'
In [6]: pd.lib.infer_dtype([u'A','B'])
Out[6]: 'mixed'
In [7]: pd.lib.infer_dtype(['A','B'])
Out[7]: 'string'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at the code, this is EXACTLY what it does. it already check string
and integer
, just add unicode
and it should work. if it doesn't (and I guess the multi-byte is failing it), then there is something else to look at that is causing a failure.
@hassanshamim pls step thru and see where it fails. This PR is getting way overcommented with a really simple thing.
Currently the multibyte usecol tests are throwing errors for the test parsers. Is this expected? If so I'll adjust the tests.
|
@@ -154,3 +154,4 @@ Bug Fixes | |||
- Bug in ``Period`` addition raises ``TypeError`` if ``Period`` is on right hand side (:issue:`13069`) | |||
- Bug in ``Peirod`` and ``Series`` or ``Index`` comparison raises ``TypeError`` (:issue:`13200`) | |||
- Bug in ``pd.set_eng_float_format()`` that would prevent NaN's from formatting (:issue:`11981`) | |||
- Bug in ``pd.read_csv()`` that prevents ``usecol`` kwarg from accepting unicode (:issue:`13219`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can't say "accepting unicode" in a broad sense, as there is a limitation for 2bytes.
This should work, but currently not. Can we keep open #13219 to handle 2 bytes? |
how about we comment out these 2 byte tests for now (and make a new issue)? @hassanshamim can you do that & rebase? |
Can do! |
"must either be all strings " | ||
"or all integers")) | ||
|
||
if usecols_dtype == 'mixed': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary, only the bottom check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the usecols_dtype not in ('integer', 'string', 'unicode'):
? I took @sinhrks advice and by checking if the type is mixed and ensuring they are all string_type, it allows usecols of mixed encoding strings, and test_usecols_with_mixed_encoding_strings
passes.
Or if you mean just have the if not all(map(lambda x: isinstance(x, string_types), usecols))
line, then just having that would fail with mixed integer and strings, as those are mixed-integer
dtype rather than just mixed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only valid types are string, integer, Unicode
mixed is not allowed (your check is duplicative)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : that's in conflict with what @sinhrks is suggested below and with @hassanshamim did. Perhaps that should be sorted out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i c - I think then that the infer types is failing on the multi byte Unicode check (which is what should be fixed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on the errors that @hassanshamim is getting when running the multibyte, but what I was suggesting was that there should be a nice helper function to check for a mixed array of unicode
and string
in inference.pyx
to avoid this cumbersome iteration checking (we might be talking about different things here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mixed strings and Unicode are an error full stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : that's where @sinhrks disagrees, hence my comment previously. I thought you had understood that when you said "i c"?
I wasn't referring anything at all to the multi-byte issue. Not sure why that is failing, especially since there are no issues when those usecols
strings are not unicode
as I mentioned in another comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no strings are either Unicode or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I remove the first check and revert to the original, where passing mixed string and unicode i.e. usecols=[u'AA', 'BB']
fails?
I think that's everything. I've been squashing my new commits and just force pushing to update this branch. Is that the normal process for incorporating feedback on PRs? Or should I have just left multiple commits then squashed once this was all ready? |
usecols=['AAA', u'BBB']) | ||
self.assertRaises(ValueError, self.read_csv, StringIO(s), | ||
usecols=[u'AAA', 'BBB']) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assertRaisesRegexp
from pandas.util.testing
to test that you get the right error message. That's stronger than assertRaises
.
|
this looks good. @gfyoung any further comments? |
@@ -882,12 +882,13 @@ def _validate_usecols_arg(usecols): | |||
or strings (column by name). Raises a ValueError | |||
if that is not the case. | |||
""" | |||
msg = ("The elements of 'usecols' must " | |||
"either be all strings or all integers") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to raise on a mixture of unicode
and string
, the current error message is not very useful in that case. Either add to this error to account for that case, OR have a different error message.
@hassanshamim : just a few comments, but almost there! |
@gfyoung updated with your recommendations. Are these formats okay?
New error message:
|
@hassanshamim : yep, that looks good, so LGTM now. Just ping when tests pass! |
@gfyoung tests have gone through. |
thanks! |
usecols
kwarg inread_csv()
#13219git diff upstream/master | flake8 --diff
This is my first time contributing to this project. Let me know what needs to be improved, especially with regards to the test and the bugfix. It was simpler than I imagined which makes me want to say I did it wrong.