BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571

nateGeorge · 2016-07-05T21:32:40Z

Rebased as PR #14060

closes Codec utf-16 aliases do not work in read_csv with c engine #13549
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

-read_csv with engine=c throws error when encoding=UTF_16 (anything other than utf-16)
-improved nosetests and moved to in pandas/io/tests/common.py
-passes pep8radius upstream/master --diff and git diff upstream/master | flake8 --diff
-put what's new entry in 0.19.0 in accordance with milestone posted on issue

see issue pandas-dev#13549 read_csv with engine=c throws error when encoding=UTF_16 or when encoding has _ or caps

sinhrks · 2016-07-05T22:52:16Z

pandas/tests/io/test_encoding_aliases.py

@@ -0,0 +1,32 @@
+import pandas, os, nose
+


can u move tests to pandas/io/tests/parser/common.py?

might as well test for variants of UTF8 as well (eg with - and _)

Ok, I added utf-8. How exactly do I run all the tests once I have it in pandas/io/tests/parser/common.py? I tried nosetests pandas/io/tests/parser/common.py from root and it ran 0 tests. When I run it with nosetests pandas/tests/io/test_encoding_aliases.py it works.

The test suite is "short enough" that I believe you can run pd.test(raise_warnings=()) and cover everything. The way the test suite works is as follows:

The main test suite is parsers.py, which imports from common.py and a WHOLE TON of other modules in that same directory. It is these imports that make it tricky to run the suite individually. If you look at the test classes, you see that they inherit from test classes like those in common.py to compose the test suites for each engine ('c' and 'python').

improved testing, added utf-8 to testing, moved testing to pandas/io/tests/parser/common.py see issue # 13549

…tf-aliases master got one commit ahead before I noticed

sinhrks · 2016-07-06T03:37:23Z

pandas/io/tests/parser/common.py

+
+        for encoding in test_encodings:
+            for engine in engines:
+                out = pd.io.parsers.read_csv(


use self.read_csv as test util automatically switches all engines (as below)

https://github.com/pydata/pandas/blob/master/pandas/io/tests/parser/test_parsers.py#L72

Thanks, wasn't sure if that would work

read_csv with engine=c throws error when encoding=UTF_16 or when encoding has _ or uppercase improved testing loops and added multibyte testing see issue pandas-dev#13549

removed `pd` from `pd.DataFrame` see issue pandas-dev#13549

fixed pep8 formatting issue see issue pandas-dev#13549

codecov-io · 2016-07-06T20:29:22Z

Current coverage is 84.38% (diff: 100%)

No coverage report found for master at 453bc26.

Powered by Codecov. Last update 453bc26...eeb7011

jreback · 2016-07-06T21:28:09Z

pandas/io/tests/parser/common.py

+            expected.to_csv(path, encoding='utf-' + str(byte), index=False)
+            for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
+                encoding = fmt.format(byte)
+                for engine in ['c', 'python', None]:


you don't iterate thru the engines, this is what self.read_csv does automatically.

So should I just write it like

for byte in [8, 16]: expected.to_csv(path, encoding='utf-' + str(byte), index=False) for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']: encoding = fmt.format(byte) result = self.read_csv( path, encoding=encoding) tm.assert_frame_equal(result, expected)

Yes, that's correct (I didn't see this before because it was hidden away), see below.

gfyoung · 2016-07-12T03:05:10Z

pandas/io/tests/parser/common.py

+    def test_read_csv_utf_aliases(self):
+        # see gh issue 13549
+        path = 'test.csv'
+        expected = DataFrame({'A': [0, 1], 'B': [2, 3],


We like to have tests that are as compact as possible. Do we really need to have this many rows for this test? Can we get away with just one? This becomes pertinent for my next point:

To make these tests as unit-like as possible, we would prefer NOT to use to_csv (if possible) and follow the StringIO(data) paradigm. I believe that is possible here because you can encode strings as utf-8 or utf-16.

I suppose we could do one row as

expected = pd.DataFrame({'mb_num': [4.8], 'multibyte': ['test']})

I used BytesIO because I don't think StringIO can support different encodings (I tried and wasn't able to get StringIO to work).

-use BytesIO instead of reading & writing file -shorten expected DataFrame

…tf-aliases Merge latest doc commit from master.

see pandas-dev#13549

nateGeorge · 2016-08-15T20:43:40Z

Ok, I think it's good now. I was having trouble with git --rebase so I did a git merge instead.

jorisvandenbossche · 2016-08-15T21:00:36Z

@nateGeorge you picked some unneeded changes in v0.19.0.
Normally, rebasing like:

git fetch upstream
git rebase upstream/master
git push -f origin fix/read_csv-utf-aliases

should work fine. What didn't work?

nateGeorge · 2016-08-15T21:11:43Z

Hmm, I was following this guide, which said to do:

git fetch upstream
git merge-base fix/read_csv-utf-aliases upstream/master

which gives a hash for the common base node of my branch and the master, then

git rebase -i ${HASH}

using that hash. This gave me the error:

error: could not apply ${some commit}

Then I looked at the file it couldn't apply some commit to (I think v0.19.0.txt), and it had dozens of merge conflicts. It seemed like I would have to go look up each commit to see which part of the merge to use. I fixed one file, and then did git rebase --continue, and the same error came up again on another file, and I think on the same file again. The main culprit was the whatsnew/v0.19.0.txt file.

So I guess next time I should just do

git fetch upstream
git rebase upstream/master
git push -f origin fix/read_csv-utf-aliases

and forget the whole hash thing.

jorisvandenbossche · 2016-08-15T21:42:35Z

You will still have to clean up the commits now (many unrelated here, and the diff is also not fully correct)

nateGeorge · 2016-08-15T23:01:54Z

Yes. I think I'll get to it tomorrow, it looks like a giant mess now 💥

…as into fix/read_csv-utf-aliases

…tf-aliases

jorisvandenbossche · 2016-08-19T11:23:20Z

@nateGeorge It seems it still didn't work well. I would do the following to fix it:

git checkout master
git checkout -b temp   # create a temporary branch to fix things
git merge --squash fix/read_csv-utf-aliases  ## squashes all commits of that branch that are ahead of master down into one
git checkout master
## first check that the temp branch is OK
git branch -m temp fix/read_csv-utf-aliases   ## rename to orignal branch

jorisvandenbossche · 2016-08-19T11:24:09Z

But possibly you will have to reset the last few commits, because now there is no diff anymore.

nateGeorge · 2016-08-19T11:28:13Z

What I just did was delete everything locally and replace with the current upstream/master from a zip. I couldn't figure out another way to do it for over an hour. Thanks for the tip, maybe I'll use it in the future. I'm working on replacing what I had in there now.

change encoding to lowercase sub - for _ see pandas-dev#13549

see pandas-dev#13549

nateGeorge · 2016-08-19T12:12:26Z

I think it should be ready now. I tested the test again (still haven't figured out an easy way to do that) and it passed.

jorisvandenbossche · 2016-08-19T20:58:51Z

@nateGeorge There are still a lot of commits here. If you use the approach I outlined above (#13571 (comment)), I think that should work.
I can also clean it up when merging if you want.

nateGeorge · 2016-08-19T23:00:47Z

Ok, I'll try what you advised and see what happens.

nateGeorge · 2016-08-19T23:10:55Z

Hmm well it wasn't letting me overwrite the branch so I deleted it and re-created it with git branch -m temp fix/read_csv-utf-aliases. Do I have to open a new PR now?

jorisvandenbossche · 2016-08-21T13:24:12Z

Normally not, as long as the branch you want to push has eventually the same name (after renaming), you can just push and the PR will be updated.
However, if you close the PR and afterwards push (or the branch on github is changed in some way), github does not allow to reopen. So see if you can reopen and then push, otherwise open a new PR.

nateGeorge · 2016-08-21T20:03:38Z

The 'reopen and comment' button is greyed out and says 'The fix/read_csv-utf-aliases branch was force-pushed or recreated.' Start a new PR I guess?

jorisvandenbossche · 2016-08-21T20:07:44Z

yep

nateGeorge · 2016-08-21T20:45:53Z

Alright, reopened as #14060

BUG: read_csv throws UnicodeDecodeError with unicode aliases

d485c4a

see issue pandas-dev#13549 read_csv with engine=c throws error when encoding=UTF_16 or when encoding has _ or caps

sinhrks added Bug Unicode Unicode strings IO CSV read_csv, to_csv labels Jul 5, 2016

sinhrks reviewed Jul 5, 2016
View reviewed changes

nateGeorge added 2 commits July 5, 2016 20:43

BUG: read_csv throws UnicodeDecodeError with unicode

ae62350

improved testing, added utf-8 to testing, moved testing to pandas/io/tests/parser/common.py see issue # 13549

Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…

36bcdd8

…tf-aliases master got one commit ahead before I noticed

sinhrks reviewed Jul 6, 2016
View reviewed changes

nateGeorge added 3 commits July 6, 2016 10:30

BUG: read_csv throws UnicodeDecodeError with unicode aliases

285ccf9

read_csv with engine=c throws error when encoding=UTF_16 or when encoding has _ or uppercase improved testing loops and added multibyte testing see issue pandas-dev#13549

BUG: read_csv throws UnicodeDecodeError with unicode aliases

173c38b

removed `pd` from `pd.DataFrame` see issue pandas-dev#13549

BUG: read_csv throws UnicodeDecodeError with unicode aliases

78d46d6

fixed pep8 formatting issue see issue pandas-dev#13549

jreback reviewed Jul 6, 2016
View reviewed changes

jreback changed the title ~~BUG: read_csv throws UnicodeDecodeError with unicode aliases~~ BUG: read_csv throws UnicodeDecodeError with unicode aliases Jul 6, 2016

nateGeorge added 4 commits July 11, 2016 18:04

chore: matched master

35dfb13

DOC: add pd.read_csv bug pandas-dev#13549

71f084e

TST: out-> result and tm.ensure_clean

da8fce4

TST: conform to PEP8

1825486

gfyoung reviewed Jul 12, 2016
View reviewed changes

nateGeorge added 2 commits July 12, 2016 02:45

TST: condense test_read_utf_aliases test

1d30333

-use BytesIO instead of reading & writing file -shorten expected DataFrame

Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…

4f680d7

…tf-aliases Merge latest doc commit from master.

nateGeorge added 3 commits August 15, 2016 14:14

docs: add note about read_csv() bug

9463dee

see pandas-dev#13549

cln: trying to merge with master

5198179

CLN: merge with master

3c30cd0

nateGeorge added 4 commits August 19, 2016 05:01

Merge branch 'fix/read_csv-utf-aliases' of github.com:nateGeorge/pand…

e77ac2d

…as into fix/read_csv-utf-aliases

CLN: reset to master branch

69ab536

Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…

1eb478d

…tf-aliases

CLN: fix small diff from upstream/master

a2f178f

jorisvandenbossche added this to the 0.19.0 milestone Aug 19, 2016

nateGeorge added 4 commits August 19, 2016 06:03

BUG: _read encoding fix

8e05f7e

change encoding to lowercase sub - for _ see pandas-dev#13549

DOC: add note on read_csv bug

ab153d5

see pandas-dev#13549

TST: add test for read_csv with unicode bug

0c1de9f

see pandas-dev#13549

CLN: fix indents and spacings

77ec966

nateGeorge closed this Aug 19, 2016

nateGeorge deleted the fix/read_csv-utf-aliases branch August 19, 2016 23:07

Uh oh!

BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571

BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571

Uh oh!

Conversation

nateGeorge commented Jul 5, 2016 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sinhrks Jul 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Jul 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 84.38% (diff: 100%)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nateGeorge Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nateGeorge commented Aug 15, 2016

Uh oh!

jorisvandenbossche commented Aug 15, 2016

Uh oh!

nateGeorge commented Aug 15, 2016

Uh oh!

jorisvandenbossche commented Aug 15, 2016

Uh oh!

nateGeorge commented Aug 15, 2016

Uh oh!

jorisvandenbossche commented Aug 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 19, 2016

Uh oh!

nateGeorge commented Aug 19, 2016

Uh oh!

nateGeorge commented Aug 19, 2016

Uh oh!

jorisvandenbossche commented Aug 19, 2016

Uh oh!

nateGeorge commented Aug 19, 2016

Uh oh!

nateGeorge commented Aug 19, 2016

Uh oh!

jorisvandenbossche commented Aug 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nateGeorge commented Aug 21, 2016

Uh oh!

jorisvandenbossche commented Aug 21, 2016

Uh oh!

nateGeorge commented Aug 21, 2016

Uh oh!

Uh oh!

nateGeorge commented Jul 5, 2016 •

edited by jorisvandenbossche

Loading

sinhrks Jul 5, 2016 •

edited

Loading

gfyoung Jul 12, 2016 •

edited

Loading

codecov-io commented Jul 6, 2016 •

edited

Loading

nateGeorge Jul 12, 2016 •

edited

Loading

gfyoung Jul 12, 2016 •

edited

Loading

jorisvandenbossche commented Aug 19, 2016 •

edited

Loading

jorisvandenbossche commented Aug 21, 2016 •

edited

Loading