DataFrame.to_json() produces malformed JSON when DataFrame contains tuples as column #20500

k-yaegashi · 2018-03-27T12:30:18Z

Code Sample, a copy-pastable example if possible

# Your code here
# coding: utf-8
import json, traceback, pandas
test = pandas.DataFrame(data = {
    'key': ['a', 'a', 'b', 'b', 'a'],
    'value': [1, 2, 3, 4, 5]
})
stat = test.groupby('key').agg(['sum', 'mean', 'count'])
stat_json = stat.to_json()
print(stat)
print(stat_json)
try:
    json.loads(stat_json)
except ValueError:
    traceback.print_exc()
stat_reset = stat.reset_index()
stat_reset_json = stat_reset.to_json()
print(stat_reset)
print(stat_reset_json)
try:
    json.loads(stat_reset_json)
except ValueError:
    traceback.print_exc()

Result by Python 2.7

    value                
      sum      mean count
key                      
a       8  2.666667     3
b       7  3.500000     2
{"["value","sum"]":{"a":8,"b":7},"["value","mean"]":{"a":2.6666666667,"b":3.5},"["value","count"]":{"a":3,"b":2}}
Traceback (most recent call last):
  File "pycheck.py", line 12, in <module>
    json.loads(stat_json)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting : delimiter: line 1 column 5 (char 4)
  key value                
        sum      mean count
0   a     8  2.666667     3
1   b     7  3.500000     2
{"["key",""]":{"0":"a","1":"b"},"["value","sum"]":{"0":8,"1":7},"["value","mean"]":{"0":2.6666666667,"1":3.5},"["value","count"]":{"0":3,"1":2}}
Traceback (most recent call last):
  File "pycheck.py", line 20, in <module>
    json.loads(stat_reset_json)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting : delimiter: line 1 column 5 (char 4)

Result by Python 3

    value                
      sum      mean count
key                      
a       8  2.666667     3
b       7  3.500000     2
{"["value","sum"]":{"a":8,"b":7},"["value","mean"]":{"a":2.6666666667,"b":3.5},"["value","count"]":{"a":3,"b":2}}
Traceback (most recent call last):
  File "pycheck.py", line 12, in <module>
    json.loads(stat_json)
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 5 (char 4)
  key value                
        sum      mean count
0   a     8  2.666667     3
1   b     7  3.500000     2
{"["key",""]":{"0":"a","1":"b"},"["value","sum"]":{"0":8,"1":7},"["value","mean"]":{"0":2.6666666667,"1":3.5},"["value","count"]":{"0":3,"1":2}}
Traceback (most recent call last):
  File "pycheck.py", line 20, in <module>
    json.loads(stat_reset_json)
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 5 (char 4)

Problem description

DataFrame.to_json() returns malformed JSON when its column contains tuple object, such as ('value', 'sum'), ('value', 'mean') etc. in this case.

Expected Output

I think, at least in this case, correct output of double quote in strings should be escaped like this:

{"[\"value\",\"sum\"]":{"a":8,"b":7},"[\"value\",\"mean\"]":{"a":2.6666666667,"b":3.5},"[\"value\",\"count\"]":{"a":3,"b":2}}

{"[\"key\",\"\"]":{"0":"a","1":"b"},"[\"value\",\"sum\"]":{"0":8,"1":7},"[\"value\",\"mean\"]":{"0":2.6666666667,"1":3.5},"[\"value\",\"count\"]":{"0":3,"1":2}}

Output of `pd.show_versions()`

Python2.7

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: 1.6.3
patsy: None
dateutil: 2.7.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Python3

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

I have checked past issues briefly, but I'm sorry if same issue was already filed.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-03-27T15:04:59Z

I'm not sure about the expected behavior here. How do you represent multiple-part keys in JSON?

k-yaegashi · 2018-03-28T01:37:48Z

I don't know what is the best solution that can satisfy everyone.

According to the JSON specification, including value other than string as object's key is impossible.

In order to avoid this problem in my work, I usually eliminate all tuples and manage to get string expression in some way, before writing JSON:

import json, pandas
def eliminate_column_tuple(df, glue):
    columns = df.columns
    new_columns = []
    for column in columns:
        if type(column) is tuple:
            joined = glue.join(column)
            if joined.endswith(glue):
                joined = joined[:-len(glue)]
            new_columns.append(joined)
        else:
            new_columns.append(column)
    df.columns = new_columns
    return df
test = pandas.DataFrame(data = {
    'key': ['a', 'a', 'b', 'b', 'a'],
    'value': [1, 2, 3, 4, 5]
})
stat = test.groupby('key').agg(['sum', 'mean', 'count'])
simple_stat = eliminate_column_tuple(stat, '|')
stat_json = simple_stat.to_json()
print(stat_json)
json.loads(stat_json)

However, such kind of solution strongly depends on the purpose.

If this problem won't be fixed, at least it is recommended to write into the API document that to_json() is useless for DataFrame with tuple columns.

TomAugspurger · 2018-03-28T01:56:28Z

I'd prefer to raise an exception (like Python does) rather than document that we produce invalid JSON.

…

On Tue, Mar 27, 2018 at 8:37 PM, k-yaegashi ***@***.***> wrote: I don't know what is the best solution that can satisfy everyone. According to the JSON specification <http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf>, including value other than string as object's key is impossible. In order to avoid this problem in my work, I usually eliminate all tuples and manage to get string expression in some way, before writing JSON: import json, pandasdef eliminate_column_tuple(df, glue): columns = df.columns new_columns = [] for column in columns: if type(column) is tuple: joined = glue.join(column) if joined.endswith(glue): joined = joined[:-len(glue)] new_columns.append(joined) else: new_columns.append(column) df.columns = new_columns return df test = pandas.DataFrame(data = { 'key': ['a', 'a', 'b', 'b', 'a'], 'value': [1, 2, 3, 4, 5] }) stat = test.groupby('key').agg(['sum', 'mean', 'count']) simple_stat = eliminate_column_tuple(stat, '|') stat_json = simple_stat.to_json()print(stat_json) json.loads(stat_json) However, such kind of solution strongly depends on the purpose. If this problem won't be fixed, at least it is recommended to write into the API document that to_json() is useless for DataFrame with tuple columns. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#20500 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIpLHwxdZaj8caZxw0CVbnVHqwkDcks5tiulxgaJpZM4S81Ul> .

jreback · 2018-03-30T19:55:12Z

yeah I agree, can put a check here that if an index is object dtype then

In [6]: infer_dtype = pd.api.types.infer_dtype

In [8]: infer_dtype(pd.Index([1,2,3]))
Out[8]: 'integer'

In [9]: infer_dtype(pd.Index(['foo', 'bar', 'baz']))
Out[9]: 'string'

In [10]: infer_dtype(pd.Index(['foo', 'bar', 1]))
Out[10]: 'mixed-integer'

In [11]: infer_dtype(pd.Index(['foo', 'bar', (1, 2)]))
Out[11]: 'mixed'

prob allow [10] but not [11]

TomAugspurger added IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize Difficulty Intermediate labels Mar 27, 2018

jreback added the Error Reporting Incorrect or improved errors from pandas label Mar 30, 2018

jreback added this to the Next Major Release milestone Mar 30, 2018

WillAyd mentioned this issue Aug 8, 2019

Remove Encoding of values in char** For Labels #27618

Merged

1 task

WillAyd closed this as completed in #27618 Aug 23, 2019

ohad83 mentioned this issue Nov 26, 2019

Series.to_json - invalid handling of quotes #29806

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.to_json() produces malformed JSON when DataFrame contains tuples as column #20500

DataFrame.to_json() produces malformed JSON when DataFrame contains tuples as column #20500

k-yaegashi commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018

k-yaegashi commented Mar 28, 2018

TomAugspurger commented Mar 28, 2018 via email

jreback commented Mar 30, 2018

DataFrame.to_json() produces malformed JSON when DataFrame contains tuples as column #20500

DataFrame.to_json() produces malformed JSON when DataFrame contains tuples as column #20500

Comments

k-yaegashi commented Mar 27, 2018

Code Sample, a copy-pastable example if possible

Result by Python 2.7

Result by Python 3

Problem description

Expected Output

Output of pd.show_versions()

Python2.7

Python3

TomAugspurger commented Mar 27, 2018

k-yaegashi commented Mar 28, 2018

TomAugspurger commented Mar 28, 2018 via email

jreback commented Mar 30, 2018

Output of `pd.show_versions()`