json_normalize does not handle nested meta paths when also using a nested record_path #27220

connormcmk · 2019-07-03T22:03:41Z

Code Sample

from pandas.io.json import json_normalize

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {
              'governor': 'Rick Scott'
         },
         'counties': [{'name': 'Dade', 'population': 12345},
                     {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {
              'governor': 'John Kasich'
         },
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

print(json_normalize(data, ['counties', 'name'], ['state', 'shortname', ['info', 'governor']], errors='ignore'))

Returns

    0    state shortname  info.governor
0   D  Florida        FL            NaN
1   a  Florida        FL            NaN
2   d  Florida        FL            NaN
3   e  Florida        FL            NaN
4   B  Florida        FL            NaN
5   r  Florida        FL            NaN
6   o  Florida        FL            NaN
7   w  Florida        FL            NaN
8   a  Florida        FL            NaN
9   r  Florida        FL            NaN
10  d  Florida        FL            NaN
11  P  Florida        FL            NaN
12  a  Florida        FL            NaN
13  l  Florida        FL            NaN
14  m  Florida        FL            NaN
15     Florida        FL            NaN
16  B  Florida        FL            NaN
17  e  Florida        FL            NaN
18  a  Florida        FL            NaN

Problem description

Running json_normalize on a nested record_path with a nested meta argument gives an error that says it cannot find info.governer.

This is inconsistent because running it again, without the nested meta argument, successfully returns the data:

# using the same data from before
print(json_normalize(data, ['counties', 'name'], ['state', 'shortname', 'info'], errors='ignore'))

    0    state shortname                         info
0   D  Florida        FL   {'governor': 'Rick Scott'}
1   a  Florida        FL   {'governor': 'Rick Scott'}
2   d  Florida        FL   {'governor': 'Rick Scott'}
3   e  Florida        FL   {'governor': 'Rick Scott'}
4   B  Florida        FL   {'governor': 'Rick Scott'}
5   r  Florida        FL   {'governor': 'Rick Scott'}
6   o  Florida        FL   {'governor': 'Rick Scott'}
7   w  Florida        FL   {'governor': 'Rick Scott'}

Similarly, using a non-nested record path also works (in fact, this is the exact sample example that can be found in the json_normalize pandas documentation).

# using the same data from before
print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]))

         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

The result is that it is much more difficult to access nested meta data.

May be related to: #21537

Expected Output

    0    state shortname  info.governor
0   D  Florida        FL            Rick Scott
1   a  Florida        FL            Rick Scott
2   d  Florida        FL            Rick Scott
3   e  Florida        FL            Rick Scott
4   B  Florida        FL            Rick Scott
5   r  Florida        FL            Rick Scott
6   o  Florida        FL            Rick Scott
7   w  Florida        FL            Rick Scott
8   a  Florida        FL            Rick Scott
9   r  Florida        FL            Rick Scott
10  d  Florida        FL            Rick Scott
11  P  Florida        FL            Rick Scott
12  a  Florida        FL            Rick Scott
13  l  Florida        FL            Rick Scott
14  m  Florida        FL            Rick Scott
15     Florida        FL            Rick Scott
16  B  Florida        FL            Rick Scott
17  e  Florida        FL            Rick Scott
18  a  Florida        FL            Rick Scott

Note that ['counties', 'name'] is an arbitrary list of strings to use as a record path, and that this example is contrived (who really needs a table comprised of each letter of a string?). However, many real scenarios can be constructed that require this sort of nested record_path extraction along with nested meta path extraction.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.125-linuxkit
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.15.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-07-05T16:54:11Z

I think makes sense so as long as the nested meta path points to an object and not an array. Would welcome a PR

cc @bhavaniravi

another-green · 2019-07-27T20:31:30Z

I am working with JSON a lot recently. I will give a try.

LTe · 2020-04-17T17:40:38Z

Test proposed by @yanglinlee https://github.com/pandas-dev/pandas/pull/27667/files#diff-490b1195dddd7f1523a30f865801ed17R291-R313

def test_nested_meta_path_with_nested_record_path(self, state_data):
    # GH 27220
    result = json_normalize(
        data=state_data,
        record_path=["counties", "name"],
        meta=["state", "shortname", ["info", "governor"]],
        errors="ignore",
    )

    ex_data = [
        [
            i
            for word in ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"]
            for i in word
        ],
        ["Florida"] * 21 + ["Ohio"] * 14,
        ["FL"] * 21 + ["OH"] * 14,
        ["Rick Scott"] * 21 + ["John Kasich"] * 14,
    ]
    expected = DataFrame(ex_data).T
    expected.columns = [0, "state", "shortname", "info.governor"]

    tm.assert_frame_equal(result, expected)

After some changes to work with new panda version https://github.com/LTe/pandas/tree/json_improve

diff --git a/doc/source/whatsnew/v1.1.0.rst b/doc/source/whatsnew/v1.1.0.rst
index 2a641a37b..af62031fa 100644
--- a/doc/source/whatsnew/v1.1.0.rst
+++ b/doc/source/whatsnew/v1.1.0.rst
@@ -576,6 +576,7 @@ I/O
 - Bug in :meth:`read_excel` did not correctly handle multiple embedded spaces in OpenDocument text cells. (:issue:`32207`)
 - Bug in :meth:`read_json` was raising ``TypeError`` when reading a list of booleans into a Series. (:issue:`31464`)
 - Bug in :func:`pandas.io.json.json_normalize` where location specified by `record_path` doesn't point to an array. (:issue:`26284`)
+- Bug in :meth:`pandas.io.json.json_normalize` when nested meta paths with a nested record path. (:issue:`27220`)

 Plotting
 ^^^^^^^^
diff --git a/pandas/tests/io/json/test_normalize.py b/pandas/tests/io/json/test_normalize.py
index 4a32f3809..6624b42b1 100644
--- a/pandas/tests/io/json/test_normalize.py
+++ b/pandas/tests/io/json/test_normalize.py
@@ -284,6 +284,27 @@ class TestJSONNormalize:
         expected = DataFrame(ex_data, columns=result.columns)
         tm.assert_frame_equal(result, expected)

+    def test_nested_meta_path_with_nested_record_path(self, state_data):
+        # GH 27220
+        result = json_normalize(
+            data=state_data,
+            record_path=["counties"],
+            meta=["state", "shortname", ["info", "governor"]],
+            errors="ignore",
+        )
+        ex_data = {
+            "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
+            "population": [12345, 40000, 60000, 1234, 1337],
+            "state": ["Florida"] * 3 + ["Ohio"] * 2,
+            "shortname": ["FL"] * 3 + ["OH"] * 2,
+            "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
+        }
+        expected = DataFrame(
+            ex_data,
+            columns=["name", "population", "state", "shortname", "info.governor"],
+        )
+        tm.assert_frame_equal(result, expected)
+
     def test_meta_name_conflict(self):
         data = [
             {

is working just fine with current master. CI run: https://dev.azure.com/piotrnielacny/piotrnielacny/_build/results?buildId=9&view=results

Even there is almost the same test right now here:

pandas/pandas/tests/io/json/test_normalize.py

Lines 251 to 285 in cd52502

    
           def test_shallow_nested(self): 
        
               data = [ 
        
                   { 
        
                       "state": "Florida", 
        
                       "shortname": "FL", 
        
                       "info": {"governor": "Rick Scott"}, 
        
                       "counties": [ 
        
                           {"name": "Dade", "population": 12345}, 
        
                           {"name": "Broward", "population": 40000}, 
        
                           {"name": "Palm Beach", "population": 60000}, 
        
                       ], 
        
                   }, 
        
                   { 
        
                       "state": "Ohio", 
        
                       "shortname": "OH", 
        
                       "info": {"governor": "John Kasich"}, 
        
                       "counties": [ 
        
                           {"name": "Summit", "population": 1234}, 
        
                           {"name": "Cuyahoga", "population": 1337}, 
        
                       ], 
        
                   }, 
        
               ] 
        
               result = json_normalize( 
        
                   data, "counties", ["state", "shortname", ["info", "governor"]] 
        
               ) 
        
               ex_data = { 
        
                   "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"], 
        
                   "state": ["Florida"] * 3 + ["Ohio"] * 2, 
        
                   "shortname": ["FL", "FL", "FL", "OH", "OH"], 
        
                   "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2, 
        
                   "population": [12345, 40000, 60000, 1234, 1337], 
        
               } 
        
               expected = DataFrame(ex_data, columns=result.columns) 
        
               tm.assert_frame_equal(result, expected)

def test_shallow_nested(self):
    data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": "Rick Scott"},
            "counties": [
                {"name": "Dade", "population": 12345},
                {"name": "Broward", "population": 40000},
                {"name": "Palm Beach", "population": 60000},
            ],
        },
        {
            "state": "Ohio",
            "shortname": "OH",
            "info": {"governor": "John Kasich"},
            "counties": [
                {"name": "Summit", "population": 1234},
                {"name": "Cuyahoga", "population": 1337},
            ],
        },
    ]

    result = json_normalize(
        data, "counties", ["state", "shortname", ["info", "governor"]]
    )
    ex_data = {
        "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
        "state": ["Florida"] * 3 + ["Ohio"] * 2,
        "shortname": ["FL", "FL", "FL", "OH", "OH"],
        "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
        "population": [12345, 40000, 60000, 1234, 1337],
    }
    expected = DataFrame(ex_data, columns=result.columns)
    tm.assert_frame_equal(result, expected)

I think this issue is fixed in master and we can close it.

LTe · 2020-04-20T19:36:48Z

cc @WillAyd @jreback

mroeschke · 2020-05-08T21:08:16Z

Happy to add that test to the test suite as a regression test

MarcoGorelli · 2020-05-09T13:14:47Z

I think this issue is fixed in master and we can close it.

I'm probably missing something (if so, no need to reply, sorry for the disruption) but isn't the original issue still not working? If I run

from pandas import json_normalize

data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000},
        ],
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337},
        ],
    },
]

print(
    json_normalize(
        data,
        ["counties", "name"],
        ["state", "shortname", ["info", "governor"]],
        errors="ignore",
    )
)

on master then I get

TypeError: {'name': 'Dade', 'population': 12345} has non list value Dade for path name. Must be list or null.

It seems to me that the proposed regression test is for the case

print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]))

which, according to the OP, was already working

LTe · 2020-05-09T14:56:06Z

@MarcoGorelli TypeError is expected, because string is not treated as a list. But I think regression test is already there:

pandas/pandas/tests/io/json/test_normalize.py

Line 223 in cd52502

deep_nested, ["states", "cities"], meta=["country", ["states", "name"]]

seidnerj · 2024-09-16T18:25:07Z

Hey all,

Not sure if this is a regression or a different issue, but I'm seeing this issue with a slight variation. The following works fine:

data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": 
                    "Rick Scott"},
            "counties": {
                "nodes": [
                    {"name": "Dade", "population": 12345},
                    {"name": "Broward", "population": 40000},
                    {"name": "Palm Beach", "population": 60000},
                ],
                "__typename": "County"
            }
        }
    ]

pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info"]])
         name  population    state shortname                        info
0        Dade       12345  Florida        FL  {'governor': 'Rick Scott'}
1     Broward       40000  Florida        FL  {'governor': 'Rick Scott'}
2  Palm Beach       60000  Florida        FL  {'governor': 'Rick Scott'}

However, the following does not:

pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info", "governor"]])
Traceback (most recent call last):
  File "/Users/user/Applications/PyCharm.app/Contents/plugins/python/helpers-pro/pydevd_asyncio/pydevd_asyncio_utils.py", line 117, in _exec_async_code
    result = func()
             ^^^^^^
  File "<input>", line 1, in <module>
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 517, in json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 496, in _recursive_extract
    _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 513, in _recursive_extract
    meta_val = _pull_field(obj, val[level:])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 408, in _pull_field
    raise KeyError(
KeyError: "Key 'governor' not found. To replace missing values of 'governor' with np.nan, pass in errors='ignore'"

Dropping the "nodes" key again gets this working:

data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": 
                    "Rick Scott"},
            "counties": [
                    {"name": "Dade", "population": 12345},
                    {"name": "Broward", "population": 40000},
                    {"name": "Palm Beach", "population": 60000}
                ]
        }
    ]

pd.json_normalize(data, record_path=["counties"], meta=["state", "shortname", ["info", "governor"]])
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott

These are the versions I'm using (dropped all "None" for brevity)

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.11.8.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 24.0.0
Version               : Darwin Kernel Version 24.0.0: Mon Aug 12 20:51:54 PDT 2024; root:xnu-11215.1.10~2/RELEASE_ARM64_T6000
machine               : arm64
processor             : arm
byteorder             : little
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8
pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.2
dateutil              : 2.9.0.post0
setuptools            : 74.1.2
pip                   : 24.2
html5lib              : 1.1
jinja2                : 3.1.4
tzdata                : 2024.1

WillAyd added the IO JSON read_json, to_json, json_normalize label Jul 5, 2019

WillAyd added this to the Contributions Welcome milestone Jul 5, 2019

another-green mentioned this issue Jul 31, 2019

BUG: fix nested meta path bug (GH 27220) #27667

Closed

5 tasks

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed IO JSON read_json, to_json, json_normalize labels May 8, 2020

LTe mentioned this issue May 9, 2020

Add regression test for nested meta path in json_normalize #34090

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.1 May 9, 2020

jreback added the IO JSON read_json, to_json, json_normalize label May 9, 2020

jreback closed this as completed in #34090 May 9, 2020

maegul mentioned this issue Mar 19, 2021

BUG: json_normalize does not handle combining nested record_path and nested meta as expected #40514

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json_normalize does not handle nested meta paths when also using a nested record_path #27220

json_normalize does not handle nested meta paths when also using a nested record_path #27220

connormcmk commented Jul 3, 2019

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented Jul 5, 2019

another-green commented Jul 27, 2019

LTe commented Apr 17, 2020

LTe commented Apr 20, 2020

mroeschke commented May 8, 2020

MarcoGorelli commented May 9, 2020

LTe commented May 9, 2020 •

edited

Loading

seidnerj commented Sep 16, 2024 •

edited

Loading

json_normalize does not handle nested meta paths when also using a nested record_path #27220

json_normalize does not handle nested meta paths when also using a nested record_path #27220

Comments

connormcmk commented Jul 3, 2019

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

WillAyd commented Jul 5, 2019

another-green commented Jul 27, 2019

LTe commented Apr 17, 2020

LTe commented Apr 20, 2020

mroeschke commented May 8, 2020

MarcoGorelli commented May 9, 2020

LTe commented May 9, 2020 • edited Loading

seidnerj commented Sep 16, 2024 • edited Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

LTe commented May 9, 2020 •

edited

Loading

seidnerj commented Sep 16, 2024 •

edited

Loading