Skip to content

BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

nateGeorge
Copy link
Contributor

@nateGeorge nateGeorge commented Jul 5, 2016

Rebased as PR #14060


-read_csv with engine=c throws error when encoding=UTF_16 (anything other than utf-16)
-improved nosetests and moved to in pandas/io/tests/common.py
-passes pep8radius upstream/master --diff and git diff upstream/master | flake8 --diff
-put what's new entry in 0.19.0 in accordance with milestone posted on issue

see issue pandas-dev#13549
read_csv with engine=c throws error when encoding=UTF_16
or when encoding has _ or caps
@sinhrks sinhrks added Bug Unicode Unicode strings IO CSV read_csv, to_csv labels Jul 5, 2016
@@ -0,0 +1,32 @@
import pandas, os, nose

Copy link
Member

@sinhrks sinhrks Jul 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u move tests to pandas/io/tests/parser/common.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well test for variants of UTF8 as well (eg with - and _)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added utf-8. How exactly do I run all the tests once I have it in pandas/io/tests/parser/common.py? I tried nosetests pandas/io/tests/parser/common.py from root and it ran 0 tests. When I run it with nosetests pandas/tests/io/test_encoding_aliases.py it works.

Copy link
Member

@gfyoung gfyoung Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite is "short enough" that I believe you can run pd.test(raise_warnings=()) and cover everything. The way the test suite works is as follows:

The main test suite is parsers.py, which imports from common.py and a WHOLE TON of other modules in that same directory. It is these imports that make it tricky to run the suite individually. If you look at the test classes, you see that they inherit from test classes like those in common.py to compose the test suites for each engine ('c' and 'python').

improved testing, added utf-8 to testing,
moved testing to pandas/io/tests/parser/common.py
see issue # 13549
…tf-aliases

master got one commit ahead before I noticed

for encoding in test_encodings:
for engine in engines:
out = pd.io.parsers.read_csv(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.read_csv as test util automatically switches all engines (as below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, wasn't sure if that would work

read_csv with engine=c throws error when encoding=UTF_16
or when encoding has _ or uppercase
improved testing loops and added multibyte testing
see issue pandas-dev#13549
@codecov-io
Copy link

codecov-io commented Jul 6, 2016

Current coverage is 84.38% (diff: 100%)

No coverage report found for master at 453bc26.

Powered by Codecov. Last update 453bc26...eeb7011

expected.to_csv(path, encoding='utf-' + str(byte), index=False)
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
encoding = fmt.format(byte)
for engine in ['c', 'python', None]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't iterate thru the engines, this is what self.read_csv does automatically.

Copy link
Contributor Author

@nateGeorge nateGeorge Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I just write it like

for byte in [8, 16]:
    expected.to_csv(path, encoding='utf-' + str(byte), index=False)
    for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
        encoding = fmt.format(byte)
        result = self.read_csv(
            path,
            encoding=encoding)
        tm.assert_frame_equal(result, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct (I didn't see this before because it was hidden away), see below.

@jreback jreback changed the title BUG: read_csv throws UnicodeDecodeError with unicode aliases BUG: read_csv throws UnicodeDecodeError with unicode aliases Jul 6, 2016
def test_read_csv_utf_aliases(self):
# see gh issue 13549
path = 'test.csv'
expected = DataFrame({'A': [0, 1], 'B': [2, 3],
Copy link
Member

@gfyoung gfyoung Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We like to have tests that are as compact as possible. Do we really need to have this many rows for this test? Can we get away with just one? This becomes pertinent for my next point:

  2. To make these tests as unit-like as possible, we would prefer NOT to use to_csv (if possible) and follow the StringIO(data) paradigm. I believe that is possible here because you can encode strings as utf-8 or utf-16.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could do one row as

expected = pd.DataFrame({'mb_num': [4.8], 'multibyte': ['test']})

I used BytesIO because I don't think StringIO can support different encodings (I tried and wasn't able to get StringIO to work).

-use BytesIO instead of reading & writing file
-shorten expected DataFrame
@nateGeorge
Copy link
Contributor Author

Ok, I think it's good now. I was having trouble with git --rebase so I did a git merge instead.

@jorisvandenbossche
Copy link
Member

@nateGeorge you picked some unneeded changes in v0.19.0.
Normally, rebasing like:

git fetch upstream
git rebase upstream/master
git push -f origin fix/read_csv-utf-aliases

should work fine. What didn't work?

@nateGeorge
Copy link
Contributor Author

Hmm, I was following this guide, which said to do:

git fetch upstream
git merge-base fix/read_csv-utf-aliases upstream/master

which gives a hash for the common base node of my branch and the master, then

git rebase -i ${HASH}

using that hash. This gave me the error:

error: could not apply ${some commit}

Then I looked at the file it couldn't apply some commit to (I think v0.19.0.txt), and it had dozens of merge conflicts. It seemed like I would have to go look up each commit to see which part of the merge to use. I fixed one file, and then did git rebase --continue, and the same error came up again on another file, and I think on the same file again. The main culprit was the whatsnew/v0.19.0.txt file.

So I guess next time I should just do

git fetch upstream
git rebase upstream/master
git push -f origin fix/read_csv-utf-aliases

and forget the whole hash thing.

@jorisvandenbossche
Copy link
Member

You will still have to clean up the commits now (many unrelated here, and the diff is also not fully correct)

@nateGeorge
Copy link
Contributor Author

Yes. I think I'll get to it tomorrow, it looks like a giant mess now 💥

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 19, 2016

@nateGeorge It seems it still didn't work well. I would do the following to fix it:

git checkout master
git checkout -b temp   # create a temporary branch to fix things
git merge --squash fix/read_csv-utf-aliases  ## squashes all commits of that branch that are ahead of master down into one
git checkout master
## first check that the temp branch is OK
git branch -m temp fix/read_csv-utf-aliases   ## rename to orignal branch

@jorisvandenbossche
Copy link
Member

But possibly you will have to reset the last few commits, because now there is no diff anymore.

@nateGeorge
Copy link
Contributor Author

What I just did was delete everything locally and replace with the current upstream/master from a zip. I couldn't figure out another way to do it for over an hour. Thanks for the tip, maybe I'll use it in the future. I'm working on replacing what I had in there now.

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Aug 19, 2016
@nateGeorge
Copy link
Contributor Author

I think it should be ready now. I tested the test again (still haven't figured out an easy way to do that) and it passed.

@jorisvandenbossche
Copy link
Member

@nateGeorge There are still a lot of commits here. If you use the approach I outlined above (#13571 (comment)), I think that should work.
I can also clean it up when merging if you want.

@nateGeorge
Copy link
Contributor Author

Ok, I'll try what you advised and see what happens.

@nateGeorge nateGeorge closed this Aug 19, 2016
@nateGeorge nateGeorge deleted the fix/read_csv-utf-aliases branch August 19, 2016 23:07
@nateGeorge
Copy link
Contributor Author

Hmm well it wasn't letting me overwrite the branch so I deleted it and re-created it with git branch -m temp fix/read_csv-utf-aliases. Do I have to open a new PR now?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 21, 2016

Normally not, as long as the branch you want to push has eventually the same name (after renaming), you can just push and the PR will be updated.
However, if you close the PR and afterwards push (or the branch on github is changed in some way), github does not allow to reopen. So see if you can reopen and then push, otherwise open a new PR.

@nateGeorge
Copy link
Contributor Author

The 'reopen and comment' button is greyed out and says 'The fix/read_csv-utf-aliases branch was force-pushed or recreated.' Start a new PR I guess?

@jorisvandenbossche
Copy link
Member

yep

@nateGeorge
Copy link
Contributor Author

Alright, reopened as #14060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Codec utf-16 aliases do not work in read_csv with c engine