Skip to content

json_normalize has inconsistent behaviors while flattening nested array elements #21537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vuminhle opened this issue Jun 19, 2018 · 6 comments · Fixed by #29225
Closed

json_normalize has inconsistent behaviors while flattening nested array elements #21537

vuminhle opened this issue Jun 19, 2018 · 6 comments · Fixed by #29225
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@vuminhle
Copy link
Contributor

Code Sample, a copy-pastable example if possible

from pandas.io.json import json_normalize

df1 = json_normalize([{'A': {'B': 1}}])
df2 = json_normalize({'dummy': [{'A': {'B': 1}}]}, 'dummy')
print(df1)
print(df2)

Problem description

The above code produces:

A.B
0 1
A
0 {'B':1}

Looks like json_normalize recursively flattens only the top-level array (in the first call).
In the second call, it only flattens to the first level. I think it should have the same behavior as that in the first call and produces the same output.

Expected Output

A.B
0 1
A.B
0 1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.1
pip: 10.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize API Design labels Jun 19, 2018
@gfyoung
Copy link
Member

gfyoung commented Jun 19, 2018

That seems like a reasonable argument to me.

cc @jreback @WillAyd

@bpben
Copy link

bpben commented Mar 6, 2019

@gfyoung I'm interested in contributing here. It seems like it's related to this TODO in json_normalize:

# TODO: handle record value which are lists, at least error
# reasonably
data = nested_to_record(data, sep=sep)

Generally, I tend to have issues with lists of records within a key, which I believe is related to this behavior. For example:

data = {'a': 1, 'b': 2, 'c': {'d': [{'a': 1}, {'b': 2}], 'e': [{'a': 1}, {'b': 2}]}}
pd.io.json.json_normalize(data, record_path=['c'])

Problem:
From the above code I get

0 0
0 d
1 e

Expected:

0 d.a d.b e.a e.b
0 1 2 1 2

This issue seems most closely aligned here, but I can open another one. I'd also be a new contributor, so not sure if I'm throwing myself too far in the deep end here. Any thoughts/suggestions?

@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2019

I think #23861 solves this already just need to update PR

@bpben
Copy link

bpben commented Mar 6, 2019

Interesting, it does seem to solve @vuminhle 's issue, where the value is a list. However, if the value is a dictionary, this part:

recs = [nested_to_record(r, sep=sep,
max_level=max_level,
ignore_keys=ignore_keys)
if isinstance(r, dict) else r for r in recs]

Seems to be the reason it returns just the keys. Since iterating through a dictionary gets the keys, the output is just the keys, rather than the dictionary items themselves.

Any thoughts on this? Should I add this as a comment to the PR?

@gfyoung
Copy link
Member

gfyoung commented Mar 6, 2019

@bpben : Feel free to comment on the PR. However, it seems like that PR will go stale, so at some point (probably within a week), feel free to take the changes from that PR to create your own.

@mroeschke
Copy link
Member

This looks fixed on master. Could use a test to close:

In [24]: from pandas.io.json import json_normalize
    ...:
    ...: df1 = json_normalize([{'A': {'B': 1}}])
    ...: df2 = json_normalize({'dummy': [{'A': {'B': 1}}]}, 'dummy')
    ...: print(df1)
    ...: print(df2)
   A.B
0    1
   A.B
0    1

In [25]: pd.__version__
Out[25]: '0.26.0.dev0+627.gef77b5700'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize labels Oct 22, 2019
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 25, 2019
@gfyoung gfyoung added this to the 1.0 milestone Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants