Skip to content

encoding not respected on read_msgpack #10581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ruidc opened this issue Jul 15, 2015 · 1 comment
Closed

encoding not respected on read_msgpack #10581

ruidc opened this issue Jul 15, 2015 · 1 comment
Labels
Bug Unicode Unicode strings
Milestone

Comments

@ruidc
Copy link
Contributor

ruidc commented Jul 15, 2015

as discussed on https://groups.google.com/forum/#!topic/pydata/ngROaML_hLI
encoding does not seem to be respected on reading a msgpack, below i am expecting to get back what
I put in as utf8

In [17]: s
Out[17]: u'\u2019'

In [18]: s = pd.Series({'a' : u"\u2019" })

In [19]: s.values[0]
Out[19]: u'\u2019'

In [20]: pd.read_msgpack(s.to_msgpack(encoding='utf8')).values[0]
Out[20]: u'\xe2\x80\x99'

in stepping through, part of the problem seems to be that in the call to unpack on https://github.com/pydata/pandas/blob/master/pandas/io/packers.py#L134 that there is no encoding argument passed and so it defaults to latin1 in https://github.com/pydata/pandas/blob/master/pandas/io/packers.py#L558

changing L134 to :

l = list(unpack(fh, **kwargs))

and passing the encoding like:

pandas.read_msgpack(m, encoding='utf8') 

makes it work for me, however i don't have en environment set up to submit this as a pull request via GH, and we're still using 0.14.1 due to compatibility issues.

@jreback jreback added Bug Unicode Unicode strings Msgpack labels Jul 15, 2015
@jreback jreback added this to the 0.17.0 milestone Jul 15, 2015
@ruidc
Copy link
Contributor Author

ruidc commented Jul 15, 2015

On Py2.7, even after making this change, this surprisingly raises a UnicodeDecodeError in msgpack.cpp:

pandas.read_msgpack(pandas.DataFrame([[401L, u'a']], index=[0], columns=['k', 'v']).to_msgpack(encoding='utf8'), encoding='utf8')

but i'm having trouble stepping through as my PyCharm environment crashes on me when inspecting.
curiously, some small changes like changing the 401L to 40L doesn't raise the error though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

2 participants