Skip to content

pd.read_json ignoring encoding='utf-8-sig' #20598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xtofs opened this issue Apr 3, 2018 · 3 comments
Closed

pd.read_json ignoring encoding='utf-8-sig' #20598

xtofs opened this issue Apr 3, 2018 · 3 comments
Labels
Duplicate Report Duplicate issue or pull request IO JSON read_json, to_json, json_normalize

Comments

@xtofs
Copy link

xtofs commented Apr 3, 2018

I work with a byte stream (from Azure DataLake https://pypi.python.org/pypi/azure-datalake-store/0.0.19 which only supports byte stream) that has a UTF-8 byte order mark, and want to read it into a data frame.

pandas.read_json fails.
For comparison, pd.read_csv(file, lines=True, encoding='utf-8-sig') works fine with a similar file

import pandas as pd

def skip_utf_8_bom(file):
    bom = file.read(3)
    # print(bom)
    if bom != b'\xef\xbb\xbf': # undo read
        file.seek(len(bom), 1) 

path = 'sample-utf-8-sig.txt'

#works
with open(path, 'rb') as file:   
    skip_utf_8_bom(file)
    df = pd.read_json(file, lines=True, encoding='utf-8-sig')    
df

#fails
with open(path, 'rb') as file: 
    df = pd.read_json(file, lines=True, encoding='utf-8-sig')    
df

Problem description

pd.read_json seems not to be able to process the encoding='utf-8-sig' parameter.
Expected behavior is that it allows to work with byte streams with an utf-8 byte order mark

sample-utf-8-sig.txt

@TomAugspurger
Copy link
Contributor

Easiest to wrap your bytes-mode file in a TextIOWrapper.

In [14]: file = open('sample-utf-8-sig.txt', 'rb')

In [15]: file2 = io.TextIOWrapper(file, 'utf-8-sig')

In [16]: df = pd.read_json(file2, lines=True)

In [17]: df
Out[17]:
       Id       MItemId
0  273780  M1001906-001
1  273781  M1002085-001
2  273782  M1002086-001

Want to take a look at our read_json stuff to see if we're filing to do that in pandas?

@TomAugspurger TomAugspurger added IO JSON read_json, to_json, json_normalize Effort Medium labels Apr 4, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Apr 4, 2018
@jreback
Copy link
Contributor

jreback commented Apr 4, 2018

duplicate of #13774

@jreback jreback closed this as completed Apr 4, 2018
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Apr 4, 2018
@xtofs
Copy link
Author

xtofs commented Apr 5, 2018

Thanks Tom, Jeff for looking into that. The use of io.TextIOWrapper makes sense.

It still feels like read_json is ignoring the encoding parameters . Wouldn't this be something that pd.read_json should do by itself, something like if 'b' in file.mode: file = io.TextIOWrapper(file, encoding)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

3 participants