json encoding for python 2 #15715

mbochk · 2017-03-17T10:50:27Z

Code Sample, a copy-pastable example if possible

# not working
pd.read_json(path, encoding='cp1251')

# that works
import json 
with open(path, 'r') as f:
    js = json.load(f, encoding='cp1251')
pd.DataFrame(js)

Problem description

It is not mentioned explicitly in docstring that encoding option used in py3 only.

Currently pd.read_json mostly ignores encoding= option in python2.
Function pd.common._get_handle warns about using encoding with compression, but silently continues without actually using encoding otherwise.

It looks like subtasks are split in unfavourable way to pass encoding up to json.loads call.

Expected Output

One might expect pandas use encoding, to get life easier (as pandas usually do ;) ).
Or at least properly warn that option is ignored.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0.post20161110
Cython: 0.24.1
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: 0.7.6.None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-17T13:03:23Z

this is currently an open issue xref #13774

we have a tests but its not implemented on the writer side; the reader side should work.

can you provide an reproducible example showing this is not work. can add that as a test.

mbochk · 2017-03-17T13:33:52Z

# path = "path_to/example.txt"

try:
    # not working
    df1 = pd.read_json(path, encoding='cp1251')
except:
    print "pd read failed"
else:
    print "pd read complete"
try:
    import json
    with open(path, 'r') as f:
        js = json.load(f, encoding='cp1251')
    df2 = pd.DataFrame(js)
    assert df2.shape == (1, 19)
except:
    print "json read failed"
else:
    print "json read complete"

example.txt

I do achive "pd read failed", "json read complete" with attached 'example.txt'.
I have to rename extension, but its should be valid json in 'cp1251'
(notepad++ says 'windows-1251', it is synonym and gives same results).

jreback · 2017-03-17T14:07:21Z

yep I agree. something not getting decoded properly (works on py3, but not on 2). Want to have a look?

In [1]: pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-4b2729700154> in <module>()
----> 1 pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')

/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines)
    347         obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
    348                           keep_default_dates, numpy, precise_float,
--> 349                           date_unit).parse()
    350 
    351     if typ == 'series' or obj is None:

/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in parse(self)
    415 
    416         else:
--> 417             self._parse_no_numpy()
    418 
    419         if self.obj is None:

/Users/jreback/miniconda3/envs/py2.7/pandas/pandas/io/json/json.pyc in _parse_no_numpy(self)
    632         if orient == "columns":
    633             self.obj = DataFrame(
--> 634                 loads(json, precise_float=self.precise_float), dtype=None)
    635         elif orient == "split":
    636             decoded = dict((str(k), v)

ValueError: Invalid octet in UTF-8 sequence when decoding 'string'

3.5

In [1]: pd.read_json('/Users/jreback/Downloads/example.txt', encoding='cp1251')
Out[1]: 
                                    ADRES AdmArea        DDOC  DMT        DREG  KAD_KV  KAD_RN  KAD_ZU       NDOC     NREG      SOOR  STRT                                      TDOC     UNOM  VLD  \
0  Бесединское шоссе, дом 17, строение 10      []  17.07.2015   17  22.07.2015       0       0       0  01-41-321  5015930  Строение    10  Распоряжение префектуры АО города Москвы  3811559  Дом   

                                         VYVAD                                            geoData  global_id  system_object_id  
0  адрес утвержден распорядительным документом  {'center': [[37.7690069572664, 55.623022198294...  163879706           3811559

jreback · 2018-04-04T13:41:47Z

duplicate of #13774

jreback added IO JSON read_json, to_json, json_normalize Unicode Unicode strings labels Mar 17, 2017

jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate labels Mar 17, 2017

jreback added this to the Next Major Release milestone Mar 17, 2017

jreback closed this as completed Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json encoding for python 2 #15715

json encoding for python 2 #15715

mbochk commented Mar 17, 2017

jreback commented Mar 17, 2017

mbochk commented Mar 17, 2017

jreback commented Mar 17, 2017

jreback commented Apr 4, 2018

json encoding for python 2 #15715

json encoding for python 2 #15715

Comments

mbochk commented Mar 17, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Mar 17, 2017

mbochk commented Mar 17, 2017

jreback commented Mar 17, 2017

jreback commented Apr 4, 2018

Output of `pd.show_versions()`