pd.read_json ignoring encoding='utf-8-sig' #20598

xtofs · 2018-04-03T21:24:07Z

I work with a byte stream (from Azure DataLake https://pypi.python.org/pypi/azure-datalake-store/0.0.19 which only supports byte stream) that has a UTF-8 byte order mark, and want to read it into a data frame.

pandas.read_json fails.
For comparison, pd.read_csv(file, lines=True, encoding='utf-8-sig') works fine with a similar file

import pandas as pd

def skip_utf_8_bom(file):
    bom = file.read(3)
    # print(bom)
    if bom != b'\xef\xbb\xbf': # undo read
        file.seek(len(bom), 1) 

path = 'sample-utf-8-sig.txt'

#works
with open(path, 'rb') as file:   
    skip_utf_8_bom(file)
    df = pd.read_json(file, lines=True, encoding='utf-8-sig')    
df

#fails
with open(path, 'rb') as file: 
    df = pd.read_json(file, lines=True, encoding='utf-8-sig')    
df

Problem description

pd.read_json seems not to be able to process the encoding='utf-8-sig' parameter.
Expected behavior is that it allows to work with byte streams with an utf-8 byte order mark

sample-utf-8-sig.txt

TomAugspurger · 2018-04-04T12:50:15Z

Easiest to wrap your bytes-mode file in a TextIOWrapper.

In [14]: file = open('sample-utf-8-sig.txt', 'rb')

In [15]: file2 = io.TextIOWrapper(file, 'utf-8-sig')

In [16]: df = pd.read_json(file2, lines=True)

In [17]: df
Out[17]:
       Id       MItemId
0  273780  M1001906-001
1  273781  M1002085-001
2  273782  M1002086-001

Want to take a look at our read_json stuff to see if we're filing to do that in pandas?

jreback · 2018-04-04T13:42:26Z

duplicate of #13774

xtofs · 2018-04-05T02:24:58Z

Thanks Tom, Jeff for looking into that. The use of io.TextIOWrapper makes sense.

It still feels like read_json is ignoring the encoding parameters . Wouldn't this be something that pd.read_json should do by itself, something like if 'b' in file.mode: file = io.TextIOWrapper(file, encoding)

TomAugspurger added IO JSON read_json, to_json, json_normalize Effort Medium labels Apr 4, 2018

TomAugspurger added this to the Next Major Release milestone Apr 4, 2018

jreback closed this as completed Apr 4, 2018

jreback added the Duplicate Report Duplicate issue or pull request label Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.read_json ignoring encoding='utf-8-sig' #20598

pd.read_json ignoring encoding='utf-8-sig' #20598

xtofs commented Apr 3, 2018 •

edited

Loading

TomAugspurger commented Apr 4, 2018

jreback commented Apr 4, 2018

xtofs commented Apr 5, 2018

pd.read_json ignoring encoding='utf-8-sig' #20598

pd.read_json ignoring encoding='utf-8-sig' #20598

Comments

xtofs commented Apr 3, 2018 • edited Loading

Problem description

TomAugspurger commented Apr 4, 2018

jreback commented Apr 4, 2018

xtofs commented Apr 5, 2018

xtofs commented Apr 3, 2018 •

edited

Loading