-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571
Conversation
see issue pandas-dev#13549 read_csv with engine=c throws error when encoding=UTF_16 or when encoding has _ or caps
@@ -0,0 +1,32 @@ | |||
import pandas, os, nose | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u move tests to pandas/io/tests/parser/common.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might as well test for variants of UTF8 as well (eg with - and _)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added utf-8. How exactly do I run all the tests once I have it in pandas/io/tests/parser/common.py? I tried nosetests pandas/io/tests/parser/common.py
from root and it ran 0 tests. When I run it with nosetests pandas/tests/io/test_encoding_aliases.py
it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test suite is "short enough" that I believe you can run pd.test(raise_warnings=())
and cover everything. The way the test suite works is as follows:
The main test suite is parsers.py
, which imports from common.py
and a WHOLE TON of other modules in that same directory. It is these imports that make it tricky to run the suite individually. If you look at the test classes, you see that they inherit from test classes like those in common.py
to compose the test suites for each engine ('c'
and 'python'
).
improved testing, added utf-8 to testing, moved testing to pandas/io/tests/parser/common.py see issue # 13549
…tf-aliases master got one commit ahead before I noticed
|
||
for encoding in test_encodings: | ||
for engine in engines: | ||
out = pd.io.parsers.read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.read_csv
as test util automatically switches all engines (as below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, wasn't sure if that would work
read_csv with engine=c throws error when encoding=UTF_16 or when encoding has _ or uppercase improved testing loops and added multibyte testing see issue pandas-dev#13549
removed `pd` from `pd.DataFrame` see issue pandas-dev#13549
fixed pep8 formatting issue see issue pandas-dev#13549
Current coverage is 84.38% (diff: 100%)
|
expected.to_csv(path, encoding='utf-' + str(byte), index=False) | ||
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']: | ||
encoding = fmt.format(byte) | ||
for engine in ['c', 'python', None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't iterate thru the engines, this is what self.read_csv
does automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I just write it like
for byte in [8, 16]:
expected.to_csv(path, encoding='utf-' + str(byte), index=False)
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
encoding = fmt.format(byte)
result = self.read_csv(
path,
encoding=encoding)
tm.assert_frame_equal(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct (I didn't see this before because it was hidden away), see below.
read_csv
throws UnicodeDecodeError with unicode aliasesdef test_read_csv_utf_aliases(self): | ||
# see gh issue 13549 | ||
path = 'test.csv' | ||
expected = DataFrame({'A': [0, 1], 'B': [2, 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We like to have tests that are as compact as possible. Do we really need to have this many rows for this test? Can we get away with just one? This becomes pertinent for my next point:
-
To make these tests as unit-like as possible, we would prefer NOT to use
to_csv
(if possible) and follow theStringIO(data)
paradigm. I believe that is possible here because you can encode strings asutf-8
orutf-16
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could do one row as
expected = pd.DataFrame({'mb_num': [4.8], 'multibyte': ['test']})
I used BytesIO because I don't think StringIO can support different encodings (I tried and wasn't able to get StringIO to work).
-use BytesIO instead of reading & writing file -shorten expected DataFrame
…tf-aliases Merge latest doc commit from master.
Ok, I think it's good now. I was having trouble with |
@nateGeorge you picked some unneeded changes in v0.19.0.
should work fine. What didn't work? |
Hmm, I was following this guide, which said to do:
which gives a hash for the common base node of my branch and the master, then
using that hash. This gave me the error:
Then I looked at the file it couldn't apply some commit to (I think v0.19.0.txt), and it had dozens of merge conflicts. It seemed like I would have to go look up each commit to see which part of the merge to use. I fixed one file, and then did So I guess next time I should just do
and forget the whole hash thing. |
You will still have to clean up the commits now (many unrelated here, and the diff is also not fully correct) |
Yes. I think I'll get to it tomorrow, it looks like a giant mess now 💥 |
…as into fix/read_csv-utf-aliases
@nateGeorge It seems it still didn't work well. I would do the following to fix it:
|
But possibly you will have to reset the last few commits, because now there is no diff anymore. |
What I just did was delete everything locally and replace with the current upstream/master from a zip. I couldn't figure out another way to do it for over an hour. Thanks for the tip, maybe I'll use it in the future. I'm working on replacing what I had in there now. |
change encoding to lowercase sub - for _ see pandas-dev#13549
I think it should be ready now. I tested the test again (still haven't figured out an easy way to do that) and it passed. |
@nateGeorge There are still a lot of commits here. If you use the approach I outlined above (#13571 (comment)), I think that should work. |
Ok, I'll try what you advised and see what happens. |
Hmm well it wasn't letting me overwrite the branch so I deleted it and re-created it with |
Normally not, as long as the branch you want to push has eventually the same name (after renaming), you can just push and the PR will be updated. |
The 'reopen and comment' button is greyed out and says 'The fix/read_csv-utf-aliases branch was force-pushed or recreated.' Start a new PR I guess? |
yep |
Alright, reopened as #14060 |
Rebased as PR #14060
utf-16
aliases do not work in read_csv with c engine #13549git diff upstream/master | flake8 --diff
-read_csv with engine=c throws error when encoding=UTF_16 (anything other than utf-16)
-improved nosetests and moved to in pandas/io/tests/common.py
-passes
pep8radius upstream/master --diff
andgit diff upstream/master | flake8 --diff
-put what's new entry in 0.19.0 in accordance with milestone posted on issue