BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

maegul · 2021-03-19T07:31:17Z

[X ] I have checked that this issue has not already been reported.
[ X] I have confirmed this bug exists on the latest version of pandas.
[X ] (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# dummy data
test_data = {'a': 1, 
             'b': 2, 
             'c': {'cc': [333, 444]},   # records
             'd': {'dd': 'meta'}  # meta
            }

# this works fine
basic_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a'])

# this too
basic_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a', ['d']])

# this does not
# exception: KeyError: "Try running with errors='ignore' as key 'dd' is not always present"
nested_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a', ['d', 'dd']])

Problem description

See related #27220

I don't think this behaviour is expected or documented.

I think the expected behaviour would be to produce a 'd.dd' column containing 'meta' strings. Mostly because this would be the result if the record_path were not nested.

Having a shot at debugging the code in pandas/io/json/_normalize.py, I suspect I know where the problem is.

In the function _recursive_extract(), at the top, we have:

def _recursive_extract(data, path, seen_meta, level=0):
    if isinstance(data, dict):
        data = [data]
    if len(path) > 1:
        for obj in data:
            for val, key in zip(_meta, meta_keys):
                if level + 1 == len(val):
                    seen_meta[key] = _pull_field(obj, val[-1])

           # problematic line     V--- object is recursed into, which cuts off remaining metadata that also needs to be recursed into
            _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)

In the line at the bottom of the extract above, the data object is recursed into, because the record_path (the path variable here) is nested.

But as a result the next iteration only has access to the next level of the record_path and nothing else including the whole of the meta data that also needs to be recursed into. The remaining code in _recursive_extract tries to perform just this recursion into the remaining metadata, but can't find any metadata and so fails.

This can be more or less proven by an absurd example that doesn't fail.

Here, the meta path will be partially (but not completely) duplicated within the record_path.

test_data = {'a': 1, 
             'b': 2, 
             'c': {'cc': [333, 444], 'dd': 'wrong_meta'},   # 'dd' is a duplication of the second level of the meta path
             'd': {'dd': 'meta'}  # original meta_path
            }

And now the code from above will not fail, but produce the wrong but different output again.

bad_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a', ['d', 'dd']])

This produces the table below where the 'dd' key within the dict at 'c' has been taken up as the metadata because it does survive the recursion process.

	0	a	d.dd
0	333	1	wrong_meta
1	444	1	wrong_meta

Though I'm not on top of this code base, the solution could be adjusting _recursive_extract to take a separate metadata argument so that it can be separately recursed into, or, somehow separating the treatment of records and metadata at the appropriate node in the tree of the data.

Expected Output

To produce a 'd.dd' column containing 'meta' strings.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : f2c8480
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.186-146.268.amzn2.x86_64
Version : #1 SMP Tue Jul 14 18:16:52 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.3
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200712
Cython : 0.29.21
pytest : 6.2.1
hypothesis : 6.0.2
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 0.17.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2022-06-03T10:45:20Z

Thanks @maegul for the report.

from #34465 (comment)

This stems by the way from the same problem as #40514 (one might call them duplicates):

will close this as a duplicate to help keep discussion in one place.

maegul added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2021

jbrockmendel added the IO JSON read_json, to_json, json_normalize label Jun 6, 2021

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 19, 2021

tsteffek mentioned this issue Mar 25, 2022

BUG: unexpected behavior of json_normalize meta arg #34465

Open

3 tasks

simonjayhawkins closed this as completed Jun 3, 2022

simonjayhawkins added the Duplicate Report Duplicate issue or pull request label Jun 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

maegul commented Mar 19, 2021 •

edited

Loading

INSTALLED VERSIONS

simonjayhawkins commented Jun 3, 2022

BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

Comments

maegul commented Mar 19, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

simonjayhawkins commented Jun 3, 2022

maegul commented Mar 19, 2021 •

edited

Loading

Output of `pd.show_versions()`