json_normalize has inconsistent behaviors while flattening nested array elements #21537

vuminhle · 2018-06-19T08:07:09Z

Code Sample, a copy-pastable example if possible

from pandas.io.json import json_normalize

df1 = json_normalize([{'A': {'B': 1}}])
df2 = json_normalize({'dummy': [{'A': {'B': 1}}]}, 'dummy')
print(df1)
print(df2)

Problem description

The above code produces:

	A.B
0	1

	A
0	{'B':1}

Looks like json_normalize recursively flattens only the top-level array (in the first call).
In the second call, it only flattens to the first level. I think it should have the same behavior as that in the first call and produces the same output.

Expected Output

	A.B
0	1

	A.B
0	1

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.1
pip: 10.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: 6.3.1
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-06-19T08:26:08Z

That seems like a reasonable argument to me.

cc @jreback @WillAyd

bpben · 2019-03-06T17:39:46Z

@gfyoung I'm interested in contributing here. It seems like it's related to this TODO in json_normalize:

pandas/pandas/io/json/normalize.py

Lines 204 to 206 in edb71fd

    
           # TODO: handle record value which are lists, at least error 
        
           #       reasonably 
        
           data = nested_to_record(data, sep=sep)

Generally, I tend to have issues with lists of records within a key, which I believe is related to this behavior. For example:

data = {'a': 1, 'b': 2, 'c': {'d': [{'a': 1}, {'b': 2}], 'e': [{'a': 1}, {'b': 2}]}}
pd.io.json.json_normalize(data, record_path=['c'])

Problem:
From the above code I get

0	0
0	d
1	e

Expected:

0	d.a	d.b	e.a	e.b
0	1	2	1	2

This issue seems most closely aligned here, but I can open another one. I'd also be a new contributor, so not sure if I'm throwing myself too far in the deep end here. Any thoughts/suggestions?

WillAyd · 2019-03-06T17:58:24Z

I think #23861 solves this already just need to update PR

bpben · 2019-03-06T18:44:32Z

Interesting, it does seem to solve @vuminhle 's issue, where the value is a list. However, if the value is a dictionary, this part:

pandas/pandas/io/json/normalize.py

Lines 287 to 290 in fca2a27

    
           recs = [nested_to_record(r, sep=sep, 
        
                                    max_level=max_level, 
        
                                    ignore_keys=ignore_keys) 
        
                   if isinstance(r, dict) else r for r in recs]

Seems to be the reason it returns just the keys. Since iterating through a dictionary gets the keys, the output is just the keys, rather than the dictionary items themselves.

Any thoughts on this? Should I add this as a comment to the PR?

gfyoung · 2019-03-06T23:24:32Z

@bpben : Feel free to comment on the PR. However, it seems like that PR will go stale, so at some point (probably within a week), feel free to take the changes from that PR to create your own.

mroeschke · 2019-10-22T04:32:10Z

This looks fixed on master. Could use a test to close:

In [24]: from pandas.io.json import json_normalize
    ...:
    ...: df1 = json_normalize([{'A': {'B': 1}}])
    ...: df2 = json_normalize({'dummy': [{'A': {'B': 1}}]}, 'dummy')
    ...: print(df1)
    ...: print(df2)
   A.B
0    1
   A.B
0    1

In [25]: pd.__version__
Out[25]: '0.26.0.dev0+627.gef77b5700'

Closes pandas-dev#21537

gfyoung added IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize API Design labels Jun 19, 2018

bpben mentioned this issue Mar 7, 2019

Enhanced json normalize #23861

Closed

3 tasks

connormcmk mentioned this issue Jul 3, 2019

json_normalize does not handle nested meta paths when also using a nested record_path #27220

Closed

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize labels Oct 22, 2019

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 25, 2019

TST: Add test for nested JSON normalization

1a1ca4d

Closes pandas-dev#21537

gfyoung mentioned this issue Oct 25, 2019

TST: Add test for nested JSON normalization #29225

Merged

WillAyd closed this as completed in #29225 Oct 25, 2019

gfyoung added this to the 1.0 milestone Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json_normalize has inconsistent behaviors while flattening nested array elements #21537

json_normalize has inconsistent behaviors while flattening nested array elements #21537

vuminhle commented Jun 19, 2018

INSTALLED VERSIONS

gfyoung commented Jun 19, 2018

bpben commented Mar 6, 2019

WillAyd commented Mar 6, 2019

bpben commented Mar 6, 2019

gfyoung commented Mar 6, 2019

mroeschke commented Oct 22, 2019

json_normalize has inconsistent behaviors while flattening nested array elements #21537

json_normalize has inconsistent behaviors while flattening nested array elements #21537

Comments

vuminhle commented Jun 19, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Jun 19, 2018

bpben commented Mar 6, 2019

WillAyd commented Mar 6, 2019

bpben commented Mar 6, 2019

gfyoung commented Mar 6, 2019

mroeschke commented Oct 22, 2019

Output of `pd.show_versions()`