encoding not respected on read_msgpack #10581

ruidc · 2015-07-15T11:36:53Z

as discussed on https://groups.google.com/forum/#!topic/pydata/ngROaML_hLI
encoding does not seem to be respected on reading a msgpack, below i am expecting to get back what
I put in as utf8

In [17]: s
Out[17]: u'\u2019'

In [18]: s = pd.Series({'a' : u"\u2019" })

In [19]: s.values[0]
Out[19]: u'\u2019'

In [20]: pd.read_msgpack(s.to_msgpack(encoding='utf8')).values[0]
Out[20]: u'\xe2\x80\x99'

in stepping through, part of the problem seems to be that in the call to unpack on https://github.com/pydata/pandas/blob/master/pandas/io/packers.py#L134 that there is no encoding argument passed and so it defaults to latin1 in https://github.com/pydata/pandas/blob/master/pandas/io/packers.py#L558

changing L134 to :

l = list(unpack(fh, **kwargs))

and passing the encoding like:

pandas.read_msgpack(m, encoding='utf8')

makes it work for me, however i don't have en environment set up to submit this as a pull request via GH, and we're still using 0.14.1 due to compatibility issues.

The text was updated successfully, but these errors were encountered:

ruidc · 2015-07-15T16:00:59Z

On Py2.7, even after making this change, this surprisingly raises a UnicodeDecodeError in msgpack.cpp:

pandas.read_msgpack(pandas.DataFrame([[401L, u'a']], index=[0], columns=['k', 'v']).to_msgpack(encoding='utf8'), encoding='utf8')

but i'm having trouble stepping through as my PyCharm environment crashes on me when inspecting.
curiously, some small changes like changing the 401L to 40L doesn't raise the error though.

jreback added Bug Unicode Unicode strings Msgpack labels Jul 15, 2015

jreback added this to the 0.17.0 milestone Jul 15, 2015

kawochen mentioned this issue Jul 28, 2015

BUG: GH10581 where read_msgpack does not respect encoding #10686

Merged

jreback closed this as completed in #10686 Aug 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding not respected on read_msgpack #10581

encoding not respected on read_msgpack #10581

ruidc commented Jul 15, 2015

ruidc commented Jul 15, 2015

encoding not respected on read_msgpack #10581

encoding not respected on read_msgpack #10581

Comments

ruidc commented Jul 15, 2015

ruidc commented Jul 15, 2015