Skip to content

to_json of df with multi-index header, followed by read_json results in incorrectly formatted column headers #4889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnorwood opened this issue Sep 19, 2013 · 21 comments
Labels
Enhancement IO JSON read_json, to_json, json_normalize

Comments

@jnorwood
Copy link

cc @Komnomnomnom

I'm using a recent anaconda build on Windows, which includes v 0.12.0 pandas. I'm attempting to use a multi-index header, write it out to a json file, import it and get the same formatted dataframe. The column headers come out stacked vertically instead.

import numpy as np
import pandas as pd
pd.set_option('html',False)

index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
names=['foo', 'bar'])
df = pd.DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
dft = df.transpose()
dft
dft.to_json('D:\xxmt.json',orient='split')
dftj=pd.read_json('D:\xxmt.json',orient='split')
dftj

@jnorwood
Copy link
Author

I should also point out that the json output omits the names info that would be required to restore the df completely (names=['foo', 'bar'])

@jnorwood jnorwood reopened this Sep 19, 2013
@jnorwood
Copy link
Author

ok, if this is going to be an enhancement request, let me make another suggestion...
The stored json data should include all the info necessary to restore the same dataframe ... so it shouldn't be necessary for the reader to specify the orient parameter when reading.

@jreback
Copy link
Contributor

jreback commented Sep 19, 2013

@jnorwood so you are using these for 'storage'? JSON is not traditionally a reflexive serialization format. IOW, it takes a schema to ensure deserialization of a serialization will work. (as types can be subtly different, ordering is not guaranteed, etc.). CSV is like this too, SQL is partially like this. HDF5 can be reflexive as it stores the meta data.

@jnorwood
Copy link
Author

The current JSON output does (almost) have the description of a dataframes structure including support for multi-index headers. The current output with orient='split' is already implementing an implied schema based on the dataframe structure. It is missing "names". I think it could also add "orient". Then all that is left is for the reader to read back and restore the info correctly. In general, if pandas is going to support multi-index headers fully, then I think it needs to provide a way to handle it correctly in whatever output you export to and import from. If you just state what the schema is then users can generate the proper format to still take advantage of the multi-index, which is a nice feature of the dataframe.

Here is a sample json output with orient='split'

{"columns":[["foo","one"],["foo","two"],["foo","three"],["bar","one"],["bar","two"],["baz","two"],["baz","three"],["qux","one"],["qux","two"],["qux","three"]],"index":["A","B","C"],"data":[[1.4363830373,-0.3037434653,-1.0501832047,0.3830191489,-0.6174135405,-1.2378511344,-0.6473196664,1.1436025073,0.2124401475,0.5732940926],[0.6103747676,-0.8948600162,0.5987307019,-0.3098951553,-1.659220121,-1.8047032839,1.006122262,-0.9470006317,-0.2870211491,-1.9018432636],[-0.6036721203,0.7321329545,-0.6730598856,0.0215133965,-0.422750222,-0.6237148299,-0.2726430101,-0.7192759505,-0.3399415151,1.6568583209]]}

@Komnomnomnom
Copy link
Contributor

@jnorwood FYI to_json doesn't really support MultiIndex or higher dim containers like Panel and NDPanel right now. I'm hoping to get a PR in for 0.13 to improve the JSON docs.

Also I don't think any metadata (like the orient) should be included for aiding deserialisation. I'm thinking of the case where you are deserialising outside of pandas, it just causes confusion.

@jnorwood
Copy link
Author

I think there is more confusion without including the orient option. If someone has generated a json document for you to read in pandas, currently you'd have to guess what orient option was originally used during generation so you can enter the correct reader options. If the writer includes the orient option in the json attributes, then the reader can simply retrieve it as it does the other attributes. If some reader doesn't know what the attribute means it can just ignore it, right?

It isn't clear to me what other reader you are trying to support. I'd guess the most common use case would be pandas generating the json and then pandas reading it back. My particular use case is I'd like to generate a json format file or string, and have pandas read and display it as a multi-index table in some cases, or as a single index table in other cases. Of course we could invent another layer of communication to tell pandas what type of table orientation the json file contains, but it seems more natural to just put the info in the json file.

Really, I think you should try to consistently support all the features that pandas supports with all the readers and writers, and do it in a straightforward way that is documented. Someone has apparently made an attempt to support multi-index columns in the json writer, and it seems like it is a good start. It would be a good feature to support a simple json format that other apps could generate and so enable easy description of a multi-index table. It is a lot simpler solution than using hdf5 generation. I can see the potential importance of both options.

@jtratner
Copy link
Contributor

One point in implied by @jnorwood stands out: what's the use-case for to_json? I'd assume it's less space efficient than other formats.

@Komnomnomnom
Copy link
Contributor

Well my main use case is (well, was more than is) interoperability. Specifically with a web frontend, deserialisation in JS, for plotting etc.

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

@Komnomnomnom progresss on this?

@Komnomnomnom
Copy link
Contributor

@jreback as of right now my priorities for pandasjson are MultiIndex support and Panel (+ higher dim) support. No work done in this direction yet but I think 0.14 is a good target to aim for.

So this issue to track MultiIndex support and #5428 for tracking Panel support?

(As stated above I'm not a fan of the idea of adding extra metadata onto the JSON encoded Pandas object. I'd favour a 'clean' representation of the data so it's easily portable. I'd welcome other's opinions though.)

@jreback
Copy link
Contributor

jreback commented Feb 17, 2014

@Komnomnomnom sounds good on the priorities

I also agree with keeping clean json

I think this issue concerns some sort of auto sniffing of the orientation
not sure how hard that would be (and prob would be ambiguous in some cases)

maybe accept orient= 'infer' at some point?

@Komnomnomnom
Copy link
Contributor

I like the idea of orient=infer on decoding, although yes it would be ambiguous for index and columns formats.

In [24]: df
Out[24]: 
              Open    High     Low   Close   Volume  Adj Close
Date                                                          
2010-01-04  626.95  629.51  624.24  626.75  1956200     626.75
2010-01-05  627.18  627.84  621.54  623.99  3004700     623.99
2010-01-06  625.86  625.86  606.36  608.26  3978700     608.26
2010-01-07  609.40  610.00  592.65  594.10  6414300     594.10
2010-01-08  592.00  603.25  589.11  602.02  4724300     602.02
2010-01-11  604.46  604.46  594.04  601.11  7212900     601.11
2010-01-12  597.65  598.16  588.00  590.48  4853300     590.48
2010-01-13  576.49  588.38  573.90  587.09  6496600     587.09
2010-01-14  583.90  594.20  582.81  589.85  4240100     589.85
2010-01-15  593.34  593.56  578.04  580.00  5434500     580.00

[10 rows x 6 columns]

In [19]: df.to_json(orient='columns')
Out[19]: '{"Open":{"1262563200000":626.95,"1262649600000":627.18,"1262736000000":625.86,"1262822400000":609.4,"1262908800000":592.0,"1263168000000":604.46,"1263254400000":597.65,"1263340800000":576.49,"1263427200000":583.9,"1263513600000":593.34},"High":{"1262563200000":629.51,"1262649600000":627.84,"1262736000000":625.86,"1262822400000":610.0,"1262908800000":603.25,"1263168000000":604.46,"1263254400000":598.16,"1263340800000":588.38,"1263427200000":594.2,"1263513600000":593.56},"Low":{"1262563200000":624.24,"1262649600000":621.54,"1262736000000":606.36,"1262822400000":592.65,"1262908800000":589.11,"1263168000000":594.04,"1263254400000":588.0,"1263340800000":573.9,"1263427200000":582.81,"1263513600000":578.04},"Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0},"Volume":{"1262563200000":1956200.0,"1262649600000":3004700.0,"1262736000000":3978700.0,"1262822400000":6414300.0,"1262908800000":4724300.0,"1263168000000":7212900.0,"1263254400000":4853300.0,"1263340800000":6496600.0,"1263427200000":4240100.0,"1263513600000":5434500.0},"Adj Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0}}'

In [22]: df.to_json(orient='index')
Out[22]: '{"1262563200000":{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},"1262649600000":{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},"1262736000000":{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},"1262822400000":{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},"1262908800000":{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},"1263168000000":{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},"1263254400000":{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},"1263340800000":{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},"1263427200000":{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},"1263513600000":{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}}'

In [20]: df.to_json(orient='split')
Out[20]: '{"columns":["Open","High","Low","Close","Volume","Adj Close"],"index":[1262563200000,1262649600000,1262736000000,1262822400000,1262908800000,1263168000000,1263254400000,1263340800000,1263427200000,1263513600000],"data":[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]}'

In [21]: df.to_json(orient='records')
Out[21]: '[{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}]'


In [23]: df.to_json(orient='values')
Out[23]: '[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]'

Getting index and columns confused would be a transpose of the data so either raise if one of these orients were inferred or do some fancy thinking, like date data being primarily on the index axis, and the index size is usually larger than the column size etc. Unfortunately index is the default orientation.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 9, 2014
@jreback
Copy link
Contributor

jreback commented Mar 31, 2014

@watsonix
Copy link

watsonix commented Apr 10, 2017

Last update to this was 2 years ago. I'm assuming its still open? I have a multi-index situation when I do:
query_df = pd.read_sql_query(sql_text, db_conn, index_col=['user_id', 'mobile_time'])
and then output via:
records_json = query_df.to_json(orient='split')

but then
pd.read_json(records_json, orient='split')
doesn't work

@jreback
Copy link
Contributor

jreback commented Apr 11, 2017

If you want a more round-trippable JSON format, this is in upcoming 0.20.0.: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0200-enhancements-table-schema

@datapythonista datapythonista removed this from the Contributions Welcome milestone Jul 8, 2018
@datapythonista datapythonista added this to the Someday milestone Jul 8, 2018
@people-can-fly
Copy link

Or you can just write json with orient = 'table'

df.to_json(path_or_buf='test.json', orient='table')

read multi_index json

pd.read_json('test.json', orient='table')

@eddy-geek
Copy link

Here is a fancy hack if you have a good reason not to use orient='table' -- in my case it is 3x bigger than orient='split'.

Notes:

  • index level names are lost by orient='split', even with this
  • it will probably break as soon as you use non-basic types (think: dates).
def to_json_split_roundtrip(df):
    if isinstance(df.index, pd.MultiIndex):
        values = [repr(v) for v in df.index.ravel()]
        df = df.set_index(pd.Index(values))
    return df.to_json(orient="split")

def from_json_split_roundtrip(j):
    df = pd.read_json(j, orient="split")
    if df.index[0].startswith('('):
        df.index = pd.MultiIndex.from_tuples((ast.literal_eval(t) for t in df.index))
    return df

Example use

df=pd.DataFrame([['1',2,3,4],['1',4,5,6],['7',4,9,10]],columns=["a","b","c","d"]).groupby(["a","b"]).sum()

display(df)
print(df.index)

as_json = to_json_split_roundtrip(df)
# send it around
r = from_json_split_roundtrip(as_json)
display(r)
print(r.index)

image

So, a built-in solution (and/or a more efficient orient='table') is still sorely needed.

LucasG0 added a commit to LucasG0/pandas that referenced this issue May 29, 2020
@Helgeb
Copy link

Helgeb commented Jun 17, 2020

There seems to be a mixup in handling the multiindex columns of read_json and to_json. As a solution you can manually rearrange the multiindex column order in the json data:

>>> df = pd.DataFrame({'a':[1,2],'b':[3,4]})
>>> df.columns = pd.MultiIndex.from_tuples([('A','B',),('C','D',)])
>>> print(df)
   A  C
   B  D
0  1  3
1  2  4
>>> json_string = df.to_json(orient='split')
>>> print(json_string)
{"columns":[["A","B"],["C","D"]],"index":[0,1],"data":[[1,3],[2,4]]}
>>> wrong_df = pd.read_json(json_string,orient='split')
>>> print(wrong_df)
   A  B
   C  D
0  1  3
1  2  4
>>> correct_columns = pd.MultiIndex.from_tuples([tuple(c) for c in json.loads(json_string)['columns']])
>>> print(correct_columns)
MultiIndex(levels=[['A', 'C'], ['B', 'D']],
           labels=[[0, 1], [0, 1]])
>>> json_dict = json.loads(json_string)
>>> json_dict['columns'] = list(range(len(json_dict['columns'])))

>>> correct_df = pd.read_json(json.dumps(json_dict),orient='split')
>>> correct_df.columns = correct_columns
>>> print(correct_df)
   A  C
   B  D
0  1  3
1  2  4

@LarsHill
Copy link

The above solution from @Helgeb correctly decodes multi-index dataframes but seems to fail in handling normal single-index dataframes correctly.

I propose a different workaround to correctly decode multi-index dataframes from serialized json strings. Namely, ast.literal_eval allows for safely evaluating tuple strings to actual tuples, which lead to correcly decoded dataframes, since pd.DataFrame parses Dict objects with tuple keys correctly (with multi-index columns).

import ast
import json

import numpy as np
import pandas as pd


def create_dataframe_from_json(df: str) -> pd.DataFrame:
    """ Decode dataframe from serialized json string.
    Safely evaluates multi-index keys (tuple strings) as tuples before converting dict to dataframe.
    """

    df = json.loads(df)

    parsed_df = {}
    for k, v in df.items():
        try:
            parsed_key = ast.literal_eval(k)
            if not isinstance(parsed_key, tuple):
                parsed_key = k
        except (ValueError, SyntaxError):
            parsed_key = k
        parsed_df[parsed_key] = v

    return pd.DataFrame(parsed_df)


def main():

    inp = [{'c1': 10, 'c2': 100}, {'c1': 11, 'c2': 110}, {'c1': 12, 'c2': 120}]
    df1 = pd.DataFrame(inp)

    df2 = pd.DataFrame(np.random.random((4, 4)))
    df2.columns = pd.MultiIndex.from_product([[1, 2], ['A', 'B']])

    df3 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
    df3.columns = pd.MultiIndex.from_tuples([('A', 1,), ('C', 'D',)])

    for df in [df1, df2, df3]:
        # original dataframe
        print(df)

        # serialize to json
        json_str = df.to_json()

        # wrong decoding for multi-index dataframes
        wrong_df = pd.read_json(json_str)
        print(wrong_df)

        # correct decoding from json (also for multi-index dataframes)
        correct_df = create_dataframe_from_json(json_str)
        print(correct_df)


if __name__ == '__main__':
    main()

@jocelyne8
Copy link

jocelyne8 commented Sep 23, 2021

I'm not sure what the status is on this issue (I'm stuck on some older version of pandas here), but in case someone else runs into the same issue, I ended up using a variant of @Helgeb 's solution, which converts both index and column, and supports both MultiIndex and Index.
This was very minimally tested and does very minimal input checking. Notably:

  • if you have both tuples and non-tuples in your index list, you'll end up with a single-level index containing tuples.
  • if you have tuples of multiple lengths in your index list, it will probably die? (untested)
  • if either index list is empty, it will probably die? (untested)
  • if you didn't serialize with orient="split", it will also probably die? (also untested)
def deserialize_multiindex_dataframe(dataframe_json: str) -> pd.DataFrame:
    """Deserialize the dataframe json into a dataframe object.
    The dataframe json must be generated with DataFrame.to_json(orient="split")
    This function is to address the fact that `pd.read_json()` isn't behaving correctly (yet)
    https://github.com/pandas-dev/pandas/issues/4889
    """
    def convert_index(json_obj):
        to_tuples = [tuple(i) if isinstance(i, list) else i for i in json_obj]
        if all(isinstance(i, list) for i in json_obj):
            return pd.MultiIndex.from_tuples(to_tuples)
        else:
            return pd.Index(to_tuples)
    json_dict = json.loads(dataframe_json)
    columns = convert_index(json_dict['columns'])
    index = convert_index(json_dict['index'])
    dataframe = pd.DataFrame(json_dict["data"], index, columns)
    return dataframe

@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@mroeschke
Copy link
Member

I think the orient="table" schema is the solution forward here as it is not possible really to represent a MultiIndex with different orients so I think we can close. Can open a new issue if something else if found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests