BUG: unexpected behavior of json_normalize meta arg #34465

evan-anderson-ts · 2020-05-29T21:23:38Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

test_input = [{
    "injection": {
        "time": "injection time"
    },
    "results": [
      {
        "time": "result time",
        "peaks": [{
           "name": "peak name A",
           "area": 1
        }, {
            "name": "peak name B",
            "area": 2
        }]
      },     
    ]
}]

df = pd.json_normalize(test_input, record_path=['results', "peaks"], meta=[["injection", "time"]])
print(df)

Output
          name  area injection.time
0  peak name A     1    result time
1  peak name B     2    result time

Problem description

I expected the string "injection time" to populate the $.injection.time column. Instead, the contents of $.results.time is populating this column. If I specify meta=["injection"], then {"time": "injection time"} correctly populates the column.

Expected Output

          name  area injection.time
0  peak name A     1    injection time
1  peak name B     2    injection time

Output of `pd.show_versions()`

/usr/local/lib/python3.6/dist-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
/usr/local/lib/python3.6/dist-packages/pandas_datareader/compat/init.py:7: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
from pandas.util.testing import assert_frame_equal

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.104+
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.4
pytz : 2018.9
dateutil : 2.8.1
pip : 19.3.1
setuptools : 46.4.0
Cython : 0.29.18
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 5.5.0
pandas_datareader: 0.8.1
bs4 : 4.6.3
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.2.6
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 2.5.9
pandas_gbq : 0.11.0
pyarrow : 0.14.1
pytables : None
pytest : 3.6.4
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.4.4
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : None
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

WillAyd · 2020-05-29T21:48:56Z

Definitely looks buggy. The code for json_normalize is here if something you are interested in debugging:

pandas/pandas/io/json/_normalize.py

Line 112 in 1cad9e5

def _json_normalize(

evan-anderson-ts · 2020-07-01T22:35:53Z

@WillAyd I had an opportunity to look at the json_normalize code. It seems like it currently only looks for metadata within the record path. Since both $.injection.time and $.results.time have the "time" key, we unintentionally grab "results.time" instead of "injection.time".

To showcase this behavior, I'll show what happens if I change the input data to have the key "results.other_time" rather than "results.time":

test_input = [{
    "injection": {
        "time": "injection time"
    },
    "results": [
      {
        "other_time": "result time",
        "peaks": [{
           "name": "peak name A",
           "area": 1
        }, {
            "name": "peak name B",
            "area": 2
        }]
      },     
    ]
}]

df = pd.json_normalize(test_input, record_path=['results', "peaks"], meta=[["injection", "time"]])
print(df)

Ouput:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    327                         try:
--> 328                             meta_val = _pull_field(obj, val[level:])
    329                         except KeyError as e:

4 frames
/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _pull_field(js, spec)
    240             for field in spec:
--> 241                 result = result[field]
    242         else:

KeyError: 'time'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-d93efff32225> in <module>()
     17 }]
     18 
---> 19 df = pd.json_normalize(test_input, record_path=['results', "peaks"], meta=[["injection", "time"]])
     20 print(df)

/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    339                 records.extend(recs)
    340 
--> 341     _recursive_extract(data, record_path, {}, level=0)
    342 
    343     result = DataFrame(records)

/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    308                         seen_meta[key] = _pull_field(obj, val[-1])
    309 
--> 310                 _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
    311         else:
    312             for obj in data:

/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    332                             else:
    333                                 raise KeyError(
--> 334                                     "Try running with "
    335                                     "errors='ignore' as key "
    336                                     f"{e} is not always present"

KeyError: "Try running with errors='ignore' as key 'time' is not always present"

The logic to implement the behavior I expected is probably pretty tricky, not a simple debug. For now, it might be enough to document this behavior, and throw an error in this edge case I found (i.e. make sure that we don't accidentally grab the wrong metadata because there's a shared key between different branches of the json).

olegbonar · 2020-11-26T10:28:07Z

+1, faced this unexpected behavior today.

tsteffek · 2022-03-25T13:37:37Z

Encountered this today. Interesting to see that this still exists in version 1.4.1.

This stems by the way from the same problem as #40514 (one might call them duplicates): that _recursive_extract for some reason assumes it's fine to ignore the meta's first level if it's already a level down in record_path.

simonjayhawkins · 2022-06-03T10:52:53Z

Interesting to see that this still exists in version 1.4.1.

@tsteffek this is an open issue, so not yet fixed.

pandas is a community project. Anyone is welcome to submit a patch or contribute in other ways such as further investigation or suggesting fixes as you have here. Thanks.

leonokubo · 2023-06-13T17:33:41Z

The problem:
def _recursive_extract (Level + 1)

Solution:

from functools import reduce
import operator
for obj in _meta:
seen_meta[".".join(obj)] = reduce(operator.getitem, obj, data[0])

evan-anderson-ts added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020

WillAyd added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020

WillAyd added this to the Contributions Welcome milestone May 29, 2020

simonjayhawkins mentioned this issue Jun 3, 2022

BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

Closed

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: unexpected behavior of json_normalize meta arg #34465

BUG: unexpected behavior of json_normalize meta arg #34465

evan-anderson-ts commented May 29, 2020

INSTALLED VERSIONS

WillAyd commented May 29, 2020

evan-anderson-ts commented Jul 1, 2020 •

edited

Loading

olegbonar commented Nov 26, 2020

tsteffek commented Mar 25, 2022

simonjayhawkins commented Jun 3, 2022

leonokubo commented Jun 13, 2023

BUG: unexpected behavior of json_normalize meta arg #34465

BUG: unexpected behavior of json_normalize meta arg #34465

Comments

evan-anderson-ts commented May 29, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented May 29, 2020

evan-anderson-ts commented Jul 1, 2020 • edited Loading

olegbonar commented Nov 26, 2020

tsteffek commented Mar 25, 2022

simonjayhawkins commented Jun 3, 2022

leonokubo commented Jun 13, 2023

Output of `pd.show_versions()`

evan-anderson-ts commented Jul 1, 2020 •

edited

Loading