Skip to content

BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maegul opened this issue Mar 19, 2021 · 1 comment
Labels
Bug Duplicate Report Duplicate issue or pull request IO JSON read_json, to_json, json_normalize

Comments

@maegul
Copy link

maegul commented Mar 19, 2021

  • [X ] I have checked that this issue has not already been reported.

  • [ X] I have confirmed this bug exists on the latest version of pandas.

  • [X ] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# dummy data
test_data = {'a': 1, 
             'b': 2, 
             'c': {'cc': [333, 444]},   # records
             'd': {'dd': 'meta'}  # meta
            }

# this works fine
basic_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a'])

# this too
basic_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a', ['d']])

# this does not
# exception: KeyError: "Try running with errors='ignore' as key 'dd' is not always present"
nested_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a', ['d', 'dd']])

Problem description

See related #27220

I don't think this behaviour is expected or documented.

I think the expected behaviour would be to produce a 'd.dd' column containing 'meta' strings. Mostly because this would be the result if the record_path were not nested.

Having a shot at debugging the code in pandas/io/json/_normalize.py, I suspect I know where the problem is.

In the function _recursive_extract(), at the top, we have:

def _recursive_extract(data, path, seen_meta, level=0):
    if isinstance(data, dict):
        data = [data]
    if len(path) > 1:
        for obj in data:
            for val, key in zip(_meta, meta_keys):
                if level + 1 == len(val):
                    seen_meta[key] = _pull_field(obj, val[-1])

           # problematic line     V--- object is recursed into, which cuts off remaining metadata that also needs to be recursed into
            _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)

In the line at the bottom of the extract above, the data object is recursed into, because the record_path (the path variable here) is nested.

But as a result the next iteration only has access to the next level of the record_path and nothing else including the whole of the meta data that also needs to be recursed into. The remaining code in _recursive_extract tries to perform just this recursion into the remaining metadata, but can't find any metadata and so fails.

This can be more or less proven by an absurd example that doesn't fail.

Here, the meta path will be partially (but not completely) duplicated within the record_path.

test_data = {'a': 1, 
             'b': 2, 
             'c': {'cc': [333, 444], 'dd': 'wrong_meta'},   # 'dd' is a duplication of the second level of the meta path
             'd': {'dd': 'meta'}  # original meta_path
            }

And now the code from above will not fail, but produce the wrong but different output again.

bad_meta = pd.json_normalize(test_data, ['c', 'cc'], meta=['a', ['d', 'dd']])

This produces the table below where the 'dd' key within the dict at 'c' has been taken up as the metadata because it does survive the recursion process.

0 a d.dd
0 333 1 wrong_meta
1 444 1 wrong_meta

Though I'm not on top of this code base, the solution could be adjusting _recursive_extract to take a separate metadata argument so that it can be separately recursed into, or, somehow separating the treatment of records and metadata at the appropriate node in the tree of the data.

Expected Output

To produce a 'd.dd' column containing 'meta' strings.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2c8480
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.186-146.268.amzn2.x86_64
Version : #1 SMP Tue Jul 14 18:16:52 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.3
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200712
Cython : 0.29.21
pytest : 6.2.1
hypothesis : 6.0.2
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 0.17.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0

@maegul maegul added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2021
@jbrockmendel jbrockmendel added the IO JSON read_json, to_json, json_normalize label Jun 6, 2021
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 19, 2021
@simonjayhawkins
Copy link
Member

Thanks @maegul for the report.

from #34465 (comment)

This stems by the way from the same problem as #40514 (one might call them duplicates):

will close this as a duplicate to help keep discussion in one place.

@simonjayhawkins simonjayhawkins added the Duplicate Report Duplicate issue or pull request label Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

4 participants