-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
to_json of df with multi-index header, followed by read_json results in incorrectly formatted column headers #4889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I should also point out that the json output omits the names info that would be required to restore the df completely (names=['foo', 'bar']) |
ok, if this is going to be an enhancement request, let me make another suggestion... |
@jnorwood so you are using these for 'storage'? JSON is not traditionally a reflexive serialization format. IOW, it takes a schema to ensure deserialization of a serialization will work. (as types can be subtly different, ordering is not guaranteed, etc.). CSV is like this too, SQL is partially like this. HDF5 can be reflexive as it stores the meta data. |
The current JSON output does (almost) have the description of a dataframes structure including support for multi-index headers. The current output with orient='split' is already implementing an implied schema based on the dataframe structure. It is missing "names". I think it could also add "orient". Then all that is left is for the reader to read back and restore the info correctly. In general, if pandas is going to support multi-index headers fully, then I think it needs to provide a way to handle it correctly in whatever output you export to and import from. If you just state what the schema is then users can generate the proper format to still take advantage of the multi-index, which is a nice feature of the dataframe. Here is a sample json output with orient='split' {"columns":[["foo","one"],["foo","two"],["foo","three"],["bar","one"],["bar","two"],["baz","two"],["baz","three"],["qux","one"],["qux","two"],["qux","three"]],"index":["A","B","C"],"data":[[1.4363830373,-0.3037434653,-1.0501832047,0.3830191489,-0.6174135405,-1.2378511344,-0.6473196664,1.1436025073,0.2124401475,0.5732940926],[0.6103747676,-0.8948600162,0.5987307019,-0.3098951553,-1.659220121,-1.8047032839,1.006122262,-0.9470006317,-0.2870211491,-1.9018432636],[-0.6036721203,0.7321329545,-0.6730598856,0.0215133965,-0.422750222,-0.6237148299,-0.2726430101,-0.7192759505,-0.3399415151,1.6568583209]]} |
@jnorwood FYI Also I don't think any metadata (like the orient) should be included for aiding deserialisation. I'm thinking of the case where you are deserialising outside of pandas, it just causes confusion. |
I think there is more confusion without including the orient option. If someone has generated a json document for you to read in pandas, currently you'd have to guess what orient option was originally used during generation so you can enter the correct reader options. If the writer includes the orient option in the json attributes, then the reader can simply retrieve it as it does the other attributes. If some reader doesn't know what the attribute means it can just ignore it, right? It isn't clear to me what other reader you are trying to support. I'd guess the most common use case would be pandas generating the json and then pandas reading it back. My particular use case is I'd like to generate a json format file or string, and have pandas read and display it as a multi-index table in some cases, or as a single index table in other cases. Of course we could invent another layer of communication to tell pandas what type of table orientation the json file contains, but it seems more natural to just put the info in the json file. Really, I think you should try to consistently support all the features that pandas supports with all the readers and writers, and do it in a straightforward way that is documented. Someone has apparently made an attempt to support multi-index columns in the json writer, and it seems like it is a good start. It would be a good feature to support a simple json format that other apps could generate and so enable easy description of a multi-index table. It is a lot simpler solution than using hdf5 generation. I can see the potential importance of both options. |
One point in implied by @jnorwood stands out: what's the use-case for to_json? I'd assume it's less space efficient than other formats. |
Well my main use case is (well, was more than is) interoperability. Specifically with a web frontend, deserialisation in JS, for plotting etc. |
@Komnomnomnom progresss on this? |
@jreback as of right now my priorities for pandasjson are MultiIndex support and Panel (+ higher dim) support. No work done in this direction yet but I think 0.14 is a good target to aim for. So this issue to track MultiIndex support and #5428 for tracking Panel support? (As stated above I'm not a fan of the idea of adding extra metadata onto the JSON encoded Pandas object. I'd favour a 'clean' representation of the data so it's easily portable. I'd welcome other's opinions though.) |
@Komnomnomnom sounds good on the priorities I also agree with keeping clean json I think this issue concerns some sort of auto sniffing of the orientation maybe accept orient= 'infer' at some point? |
I like the idea of In [24]: df
Out[24]:
Open High Low Close Volume Adj Close
Date
2010-01-04 626.95 629.51 624.24 626.75 1956200 626.75
2010-01-05 627.18 627.84 621.54 623.99 3004700 623.99
2010-01-06 625.86 625.86 606.36 608.26 3978700 608.26
2010-01-07 609.40 610.00 592.65 594.10 6414300 594.10
2010-01-08 592.00 603.25 589.11 602.02 4724300 602.02
2010-01-11 604.46 604.46 594.04 601.11 7212900 601.11
2010-01-12 597.65 598.16 588.00 590.48 4853300 590.48
2010-01-13 576.49 588.38 573.90 587.09 6496600 587.09
2010-01-14 583.90 594.20 582.81 589.85 4240100 589.85
2010-01-15 593.34 593.56 578.04 580.00 5434500 580.00
[10 rows x 6 columns]
In [19]: df.to_json(orient='columns')
Out[19]: '{"Open":{"1262563200000":626.95,"1262649600000":627.18,"1262736000000":625.86,"1262822400000":609.4,"1262908800000":592.0,"1263168000000":604.46,"1263254400000":597.65,"1263340800000":576.49,"1263427200000":583.9,"1263513600000":593.34},"High":{"1262563200000":629.51,"1262649600000":627.84,"1262736000000":625.86,"1262822400000":610.0,"1262908800000":603.25,"1263168000000":604.46,"1263254400000":598.16,"1263340800000":588.38,"1263427200000":594.2,"1263513600000":593.56},"Low":{"1262563200000":624.24,"1262649600000":621.54,"1262736000000":606.36,"1262822400000":592.65,"1262908800000":589.11,"1263168000000":594.04,"1263254400000":588.0,"1263340800000":573.9,"1263427200000":582.81,"1263513600000":578.04},"Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0},"Volume":{"1262563200000":1956200.0,"1262649600000":3004700.0,"1262736000000":3978700.0,"1262822400000":6414300.0,"1262908800000":4724300.0,"1263168000000":7212900.0,"1263254400000":4853300.0,"1263340800000":6496600.0,"1263427200000":4240100.0,"1263513600000":5434500.0},"Adj Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0}}'
In [22]: df.to_json(orient='index')
Out[22]: '{"1262563200000":{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},"1262649600000":{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},"1262736000000":{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},"1262822400000":{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},"1262908800000":{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},"1263168000000":{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},"1263254400000":{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},"1263340800000":{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},"1263427200000":{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},"1263513600000":{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}}'
In [20]: df.to_json(orient='split')
Out[20]: '{"columns":["Open","High","Low","Close","Volume","Adj Close"],"index":[1262563200000,1262649600000,1262736000000,1262822400000,1262908800000,1263168000000,1263254400000,1263340800000,1263427200000,1263513600000],"data":[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]}'
In [21]: df.to_json(orient='records')
Out[21]: '[{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}]'
In [23]: df.to_json(orient='values')
Out[23]: '[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]' Getting |
Last update to this was 2 years ago. I'm assuming its still open? I have a multi-index situation when I do: but then |
If you want a more round-trippable JSON format, this is in upcoming 0.20.0.: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0200-enhancements-table-schema |
Or you can just write json with orient = 'table'
read multi_index json
|
Here is a fancy hack if you have a good reason not to use Notes:
def to_json_split_roundtrip(df):
if isinstance(df.index, pd.MultiIndex):
values = [repr(v) for v in df.index.ravel()]
df = df.set_index(pd.Index(values))
return df.to_json(orient="split")
def from_json_split_roundtrip(j):
df = pd.read_json(j, orient="split")
if df.index[0].startswith('('):
df.index = pd.MultiIndex.from_tuples((ast.literal_eval(t) for t in df.index))
return df Example use df=pd.DataFrame([['1',2,3,4],['1',4,5,6],['7',4,9,10]],columns=["a","b","c","d"]).groupby(["a","b"]).sum()
display(df)
print(df.index)
as_json = to_json_split_roundtrip(df)
# send it around
r = from_json_split_roundtrip(as_json)
display(r)
print(r.index) So, a built-in solution (and/or a more efficient |
There seems to be a mixup in handling the multiindex columns of read_json and to_json. As a solution you can manually rearrange the multiindex column order in the json data:
|
The above solution from @Helgeb correctly decodes multi-index dataframes but seems to fail in handling normal single-index dataframes correctly. I propose a different workaround to correctly decode multi-index dataframes from serialized json strings. Namely, import ast
import json
import numpy as np
import pandas as pd
def create_dataframe_from_json(df: str) -> pd.DataFrame:
""" Decode dataframe from serialized json string.
Safely evaluates multi-index keys (tuple strings) as tuples before converting dict to dataframe.
"""
df = json.loads(df)
parsed_df = {}
for k, v in df.items():
try:
parsed_key = ast.literal_eval(k)
if not isinstance(parsed_key, tuple):
parsed_key = k
except (ValueError, SyntaxError):
parsed_key = k
parsed_df[parsed_key] = v
return pd.DataFrame(parsed_df)
def main():
inp = [{'c1': 10, 'c2': 100}, {'c1': 11, 'c2': 110}, {'c1': 12, 'c2': 120}]
df1 = pd.DataFrame(inp)
df2 = pd.DataFrame(np.random.random((4, 4)))
df2.columns = pd.MultiIndex.from_product([[1, 2], ['A', 'B']])
df3 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df3.columns = pd.MultiIndex.from_tuples([('A', 1,), ('C', 'D',)])
for df in [df1, df2, df3]:
# original dataframe
print(df)
# serialize to json
json_str = df.to_json()
# wrong decoding for multi-index dataframes
wrong_df = pd.read_json(json_str)
print(wrong_df)
# correct decoding from json (also for multi-index dataframes)
correct_df = create_dataframe_from_json(json_str)
print(correct_df)
if __name__ == '__main__':
main() |
I'm not sure what the status is on this issue (I'm stuck on some older version of pandas here), but in case someone else runs into the same issue, I ended up using a variant of @Helgeb 's solution, which converts both index and column, and supports both MultiIndex and Index.
def deserialize_multiindex_dataframe(dataframe_json: str) -> pd.DataFrame:
"""Deserialize the dataframe json into a dataframe object.
The dataframe json must be generated with DataFrame.to_json(orient="split")
This function is to address the fact that `pd.read_json()` isn't behaving correctly (yet)
https://github.com/pandas-dev/pandas/issues/4889
"""
def convert_index(json_obj):
to_tuples = [tuple(i) if isinstance(i, list) else i for i in json_obj]
if all(isinstance(i, list) for i in json_obj):
return pd.MultiIndex.from_tuples(to_tuples)
else:
return pd.Index(to_tuples)
json_dict = json.loads(dataframe_json)
columns = convert_index(json_dict['columns'])
index = convert_index(json_dict['index'])
dataframe = pd.DataFrame(json_dict["data"], index, columns)
return dataframe |
I think the |
cc @Komnomnomnom
I'm using a recent anaconda build on Windows, which includes v 0.12.0 pandas. I'm attempting to use a multi-index header, write it out to a json file, import it and get the same formatted dataframe. The column headers come out stacked vertically instead.
import numpy as np
import pandas as pd
pd.set_option('html',False)
index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
names=['foo', 'bar'])
df = pd.DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
dft = df.transpose()
dft
dft.to_json('D:\xxmt.json',orient='split')
dftj=pd.read_json('D:\xxmt.json',orient='split')
dftj
The text was updated successfully, but these errors were encountered: