Skip to content

Series.to_json produces invalid JSON with orient="index" if index is multilevel #31028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dargueta opened this issue Jan 15, 2020 · 9 comments
Closed
Labels
good first issue IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@dargueta
Copy link
Contributor

Code Sample, a copy-pastable example if possible

dataframe = pd.DataFrame({'x': [1, 2, 3], 'y': ['a', 'b', 'c']})
series = dataframe.stack()
print(series.to_json(orient="index"))

The keys in the output have unescaped quotes in them, resulting in invalid JSON. Output here is indented for clarity:

{
    "[0,"x"]": 1,
    "[0,"y"]": "a",
    "[1,"x"]": 2,
    "[1,"y"]": "b",
    "[2,"x"]": 3,
    "[2,"y"]": "c"
}

Problem description

to_json() should always either 1) produce valid JSON, or 2) throw an exception if it cannot.

Expected Output

The multilevel index in the output should be properly escaped to produce valid JSON, like so:

{
    "[0,\"x\"]": 1,
    "[0,\"y\"]": "a",
    "[1,\"x\"]": 2,
    "[1,\"y\"]": "b",
    "[2,\"x\"]": 3,
    "[2,\"y\"]": "c"
}

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.0
pip : 19.3.1
setuptools : 45.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : 0.4.0
scipy : None
sqlalchemy : 1.3.12
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 15, 2020

Thanks for the report :)

I could not reproduce this on master, I got the output

{
    "(0, 'x')": 1,
    "(0, 'y')": "a",
    "(1, 'x')": 2,
    "(1, 'y')": "b",
    "(2, 'x')": 3,
    "(2, 'y')": "c",
}

so it looks like this has been fixed already and will behave correctly in the next release.

@MarcoGorelli MarcoGorelli added the IO JSON read_json, to_json, json_normalize label Jan 15, 2020
@WillAyd
Copy link
Member

WillAyd commented Jan 15, 2020

This was most likely fixed by #27618 so be sure to try that out in 1.0

With that said I don't think there is a test explicit for a MultiIndex, so would welcome a PR to add that

@WillAyd WillAyd added Needs Tests Unit test(s) needed to prevent regressions good first issue labels Jan 15, 2020
@WillAyd WillAyd added this to the Contributions Welcome milestone Jan 15, 2020
@dargueta
Copy link
Contributor Author

dargueta commented Jan 15, 2020

I could not reproduce this on master

Ah -- I was expecting the keys to remain valid JSON so my deserialization code still blew up. I assumed that was because the JSON was still malformed.

@dargueta
Copy link
Contributor Author

dargueta commented Jan 15, 2020

My one gripe with the fix is that the keys are no longer valid JSON, as in previous releases (with the above exception). This makes it difficult to reconstruct into a nested format, e.g.

{
    0: {
        'x': 1,
        'y': 'a'
    },
    1: {
        'x': 2,
        'y': 'b',
    },
    # ...
}

This is particularly problematic when communicating with non-Python code. Loading the keys would require the transformation to happen on the Python side with ast.literal_eval() and then transmitting that instead.

@WillAyd
Copy link
Member

WillAyd commented Jan 15, 2020

@dargueta I'm not clear on what version of pandas you are referring to, what output you are getting and what you expect. If you can clarify would be helpful

@MarcoGorelli
Copy link
Member

as in previous releases

Could you please let us know which release gives you your expected output?

@MarcoGorelli MarcoGorelli added the Needs Info Clarification about behavior needed to assess issue label Jan 20, 2020
@jeandersonbc
Copy link
Contributor

From what I understood, having the following output should not be allowed

{
  "[0,"x"]":1,
  "[0,"y"]":"a",
  "[1,"x"]":2,
  "[1,"y"]":"b",
  "[2,"x"]":3,
  "[2,"y"]":"c"
}

I tried the example on 0.25.3 and indeed this is the output I got.
I didn't try running with the master version, but as I understood from the example shown by @MarcoGorelli, the issue got fixed.

@dargueta
Copy link
Contributor Author

dargueta commented Jan 21, 2020

Could you please let us know which release gives you your expected output?

0.25.3 does as long as the keys aren't strings, which is kinduva stupid restriction but for CSVs with no column headers it's what I end up having to work with.

(By the way I also tested this on 0.24.2 and got the same result. I'm using Python 3.7 for both tests.)

@WillAyd
Copy link
Member

WillAyd commented Jan 25, 2020

Closed by #31307

@WillAyd WillAyd closed this as completed Jan 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue IO JSON read_json, to_json, json_normalize Needs Info Clarification about behavior needed to assess issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

4 participants