Skip to content

ENH: json_normalize flatten lists as well #42311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cosama opened this issue Jun 29, 2021 · 1 comment
Open

ENH: json_normalize flatten lists as well #42311

cosama opened this issue Jun 29, 2021 · 1 comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@cosama
Copy link

cosama commented Jun 29, 2021

Problem

Right now json_normalize will leave lists encountered within dictionaries intact:

import pandas as pd
df = pd.json_normalize([{"a": [1, 1]}, {"a": [1, 2]}])
print(df)

output:

        a
0  [1, 1]
1  [1, 2]

Each entry is a list object in this case. I am not really sure how this is of any use really. If I for example like to do anything with the first element of each row I would have to convert this first into yet another DataFrame with something like:

df2 = pd.DataFrame({f"a.{k}": [i[k] for i in df['a']] for k in range(len(df['a'][0]))})
print(df2)

output

   a.0  a.1
0    1    1
1    1    2

Solution

It would be really useful I think, if there is a flag or something that would enable to directly flatten lists as well. Something like json_normalize(data, flatten_list=True). The list index is then used as a string in the record name, e.g. "a.0.b", "a.1.b" etc.

API breaking implications

Don't think this would break any API.

Alternatives

There are a few packages that already have some of this ability, but require additional dependencies and intermediate products, so are slowing down conversion:

@cosama cosama added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 29, 2021
@mroeschke mroeschke added IO JSON read_json, to_json, json_normalize Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@simonjayhawkins
Copy link
Member

thanks @cosama for the suggestion. We have a high bar for expanding an already extensive api.

I am not really sure how this is of any use really. If I for example like to do anything with the first element of each row I would have to convert this first into yet another DataFrame with something like:

pandas already has some methods making working with nested data easier, for instance DataFrame.explode() https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.explode.html could be used to reshape the data.

result = df.explode("a")
result["i"] = result.groupby(level=0).cumcount()
result = result.reset_index().set_index(["index", "i"]).unstack()
       a   
i      0  1
index      
0      1  1
1      1  2

which would allow easier indexing using the MultiIndex, (or combine with another step to rename and get a single level Index if really need the columns labelled 'a.0', 'a.1', ...

There is also the str accessor, despite it's name can also be used to access list items and not fail for out of range indexing.

df = pd.json_normalize([{"a": [1, 1]}, {"a": [1]}])
df.a.str[1]
0    1.0
1    NaN
Name: a, dtype: float64

There are probably more elegant ways of doing this, but given the required reshaping or item access can be done using other pandas methods, or the user could pre-process their json files instead, it maybe not worth adding additional complexity to pd.json_normalize.

for instance, what would be the expected output of the proposed enhancement if the data contained only empty lists (xref #47182) and there are probably other edge cases that i've failed to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

3 participants