ENH: `json_normalize` flatten lists as well #42311

cosama · 2021-06-29T18:19:31Z

Problem

Right now json_normalize will leave lists encountered within dictionaries intact:

import pandas as pd
df = pd.json_normalize([{"a": [1, 1]}, {"a": [1, 2]}])
print(df)

output:

        a
0  [1, 1]
1  [1, 2]

Each entry is a list object in this case. I am not really sure how this is of any use really. If I for example like to do anything with the first element of each row I would have to convert this first into yet another DataFrame with something like:

df2 = pd.DataFrame({f"a.{k}": [i[k] for i in df['a']] for k in range(len(df['a'][0]))})
print(df2)

output

   a.0  a.1
0    1    1
1    1    2

Solution

It would be really useful I think, if there is a flag or something that would enable to directly flatten lists as well. Something like json_normalize(data, flatten_list=True). The list index is then used as a string in the record name, e.g. "a.0.b", "a.1.b" etc.

API breaking implications

Don't think this would break any API.

Alternatives

There are a few packages that already have some of this ability, but require additional dependencies and intermediate products, so are slowing down conversion:

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2022-06-02T16:55:25Z

thanks @cosama for the suggestion. We have a high bar for expanding an already extensive api.

I am not really sure how this is of any use really. If I for example like to do anything with the first element of each row I would have to convert this first into yet another DataFrame with something like:

pandas already has some methods making working with nested data easier, for instance DataFrame.explode() https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.explode.html could be used to reshape the data.

result = df.explode("a")
result["i"] = result.groupby(level=0).cumcount()
result = result.reset_index().set_index(["index", "i"]).unstack()

       a   
i      0  1
index      
0      1  1
1      1  2

which would allow easier indexing using the MultiIndex, (or combine with another step to rename and get a single level Index if really need the columns labelled 'a.0', 'a.1', ...

There is also the str accessor, despite it's name can also be used to access list items and not fail for out of range indexing.

df = pd.json_normalize([{"a": [1, 1]}, {"a": [1]}])
df.a.str[1]

0    1.0
1    NaN
Name: a, dtype: float64

There are probably more elegant ways of doing this, but given the required reshaping or item access can be done using other pandas methods, or the user could pre-process their json files instead, it maybe not worth adding additional complexity to pd.json_normalize.

for instance, what would be the expected output of the proposed enhancement if the data contained only empty lists (xref #47182) and there are probably other edge cases that i've failed to consider.

cosama added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 29, 2021

mroeschke added IO JSON read_json, to_json, json_normalize Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

marickmanrho mentioned this issue May 7, 2023

BUG: json_normalize does not parse nested lists consistently #53126

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: `json_normalize` flatten lists as well #42311

ENH: `json_normalize` flatten lists as well #42311

cosama commented Jun 29, 2021

simonjayhawkins commented Jun 2, 2022

ENH: json_normalize flatten lists as well #42311

ENH: json_normalize flatten lists as well #42311

Comments

cosama commented Jun 29, 2021

Problem

Solution

API breaking implications

Alternatives

simonjayhawkins commented Jun 2, 2022

ENH: `json_normalize` flatten lists as well #42311

ENH: `json_normalize` flatten lists as well #42311