Skip to content

BUG: unexpected behavior of json_normalize meta arg #34465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
evan-anderson-ts opened this issue May 29, 2020 · 6 comments
Open
2 of 3 tasks

BUG: unexpected behavior of json_normalize meta arg #34465

evan-anderson-ts opened this issue May 29, 2020 · 6 comments
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@evan-anderson-ts
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

test_input = [{
    "injection": {
        "time": "injection time"
    },
    "results": [
      {
        "time": "result time",
        "peaks": [{
           "name": "peak name A",
           "area": 1
        }, {
            "name": "peak name B",
            "area": 2
        }]
      },     
    ]
}]

df = pd.json_normalize(test_input, record_path=['results', "peaks"], meta=[["injection", "time"]])
print(df)
Output
          name  area injection.time
0  peak name A     1    result time
1  peak name B     2    result time

Problem description

I expected the string "injection time" to populate the $.injection.time column. Instead, the contents of $.results.time is populating this column. If I specify meta=["injection"], then {"time": "injection time"} correctly populates the column.

Expected Output

          name  area injection.time
0  peak name A     1    injection time
1  peak name B     2    injection time

Output of pd.show_versions()

/usr/local/lib/python3.6/dist-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
/usr/local/lib/python3.6/dist-packages/pandas_datareader/compat/init.py:7: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
from pandas.util.testing import assert_frame_equal

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.104+
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.4
pytz : 2018.9
dateutil : 2.8.1
pip : 19.3.1
setuptools : 46.4.0
Cython : 0.29.18
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 5.5.0
pandas_datareader: 0.8.1
bs4 : 4.6.3
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.2.6
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 2.5.9
pandas_gbq : 0.11.0
pyarrow : 0.14.1
pytables : None
pytest : 3.6.4
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.4.4
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : None
numba : 0.48.0

@evan-anderson-ts evan-anderson-ts added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020
@WillAyd
Copy link
Member

WillAyd commented May 29, 2020

Definitely looks buggy. The code for json_normalize is here if something you are interested in debugging:

def _json_normalize(

@WillAyd WillAyd added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020
@WillAyd WillAyd added this to the Contributions Welcome milestone May 29, 2020
@evan-anderson-ts
Copy link
Author

evan-anderson-ts commented Jul 1, 2020

@WillAyd I had an opportunity to look at the json_normalize code. It seems like it currently only looks for metadata within the record path. Since both $.injection.time and $.results.time have the "time" key, we unintentionally grab "results.time" instead of "injection.time".

To showcase this behavior, I'll show what happens if I change the input data to have the key "results.other_time" rather than "results.time":

test_input = [{
    "injection": {
        "time": "injection time"
    },
    "results": [
      {
        "other_time": "result time",
        "peaks": [{
           "name": "peak name A",
           "area": 1
        }, {
            "name": "peak name B",
            "area": 2
        }]
      },     
    ]
}]

df = pd.json_normalize(test_input, record_path=['results', "peaks"], meta=[["injection", "time"]])
print(df)

Ouput:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    327                         try:
--> 328                             meta_val = _pull_field(obj, val[level:])
    329                         except KeyError as e:

4 frames
/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _pull_field(js, spec)
    240             for field in spec:
--> 241                 result = result[field]
    242         else:

KeyError: 'time'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-d93efff32225> in <module>()
     17 }]
     18 
---> 19 df = pd.json_normalize(test_input, record_path=['results', "peaks"], meta=[["injection", "time"]])
     20 print(df)

/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    339                 records.extend(recs)
    340 
--> 341     _recursive_extract(data, record_path, {}, level=0)
    342 
    343     result = DataFrame(records)

/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    308                         seen_meta[key] = _pull_field(obj, val[-1])
    309 
--> 310                 _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
    311         else:
    312             for obj in data:

/usr/local/lib/python3.6/dist-packages/pandas/io/json/_normalize.py in _recursive_extract(data, path, seen_meta, level)
    332                             else:
    333                                 raise KeyError(
--> 334                                     "Try running with "
    335                                     "errors='ignore' as key "
    336                                     f"{e} is not always present"

KeyError: "Try running with errors='ignore' as key 'time' is not always present"

The logic to implement the behavior I expected is probably pretty tricky, not a simple debug. For now, it might be enough to document this behavior, and throw an error in this edge case I found (i.e. make sure that we don't accidentally grab the wrong metadata because there's a shared key between different branches of the json).

@olegbonar
Copy link

+1, faced this unexpected behavior today.

@tsteffek
Copy link

Encountered this today. Interesting to see that this still exists in version 1.4.1.

This stems by the way from the same problem as #40514 (one might call them duplicates): that _recursive_extract for some reason assumes it's fine to ignore the meta's first level if it's already a level down in record_path.

@simonjayhawkins
Copy link
Member

Interesting to see that this still exists in version 1.4.1.

@tsteffek this is an open issue, so not yet fixed.

pandas is a community project. Anyone is welcome to submit a patch or contribute in other ways such as further investigation or suggesting fixes as you have here. Thanks.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@leonokubo
Copy link

The problem:
def _recursive_extract (Level + 1)

image

Solution:

from functools import reduce
import operator
for obj in _meta:
seen_meta[".".join(obj)] = reduce(operator.getitem, obj, data[0])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

7 participants