Skip to content

json_normalize does not handle nested meta paths when also using a nested record_path #27220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
connormcmk opened this issue Jul 3, 2019 · 8 comments · Fixed by #34090
Closed
Labels
good first issue IO JSON read_json, to_json, json_normalize Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@connormcmk
Copy link

Code Sample

from pandas.io.json import json_normalize

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {
              'governor': 'Rick Scott'
         },
         'counties': [{'name': 'Dade', 'population': 12345},
                     {'name': 'Broward', 'population': 40000},
                     {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {
              'governor': 'John Kasich'
         },
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]

print(json_normalize(data, ['counties', 'name'], ['state', 'shortname', ['info', 'governor']], errors='ignore'))

Returns

    0    state shortname  info.governor
0   D  Florida        FL            NaN
1   a  Florida        FL            NaN
2   d  Florida        FL            NaN
3   e  Florida        FL            NaN
4   B  Florida        FL            NaN
5   r  Florida        FL            NaN
6   o  Florida        FL            NaN
7   w  Florida        FL            NaN
8   a  Florida        FL            NaN
9   r  Florida        FL            NaN
10  d  Florida        FL            NaN
11  P  Florida        FL            NaN
12  a  Florida        FL            NaN
13  l  Florida        FL            NaN
14  m  Florida        FL            NaN
15     Florida        FL            NaN
16  B  Florida        FL            NaN
17  e  Florida        FL            NaN
18  a  Florida        FL            NaN

Problem description

Running json_normalize on a nested record_path with a nested meta argument gives an error that says it cannot find info.governer.

This is inconsistent because running it again, without the nested meta argument, successfully returns the data:

# using the same data from before
print(json_normalize(data, ['counties', 'name'], ['state', 'shortname', 'info'], errors='ignore'))
    0    state shortname                         info
0   D  Florida        FL   {'governor': 'Rick Scott'}
1   a  Florida        FL   {'governor': 'Rick Scott'}
2   d  Florida        FL   {'governor': 'Rick Scott'}
3   e  Florida        FL   {'governor': 'Rick Scott'}
4   B  Florida        FL   {'governor': 'Rick Scott'}
5   r  Florida        FL   {'governor': 'Rick Scott'}
6   o  Florida        FL   {'governor': 'Rick Scott'}
7   w  Florida        FL   {'governor': 'Rick Scott'}

Similarly, using a non-nested record path also works (in fact, this is the exact sample example that can be found in the json_normalize pandas documentation).

# using the same data from before
print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]))
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott
3      Summit        1234     Ohio        OH   John Kasich
4    Cuyahoga        1337     Ohio        OH   John Kasich

The result is that it is much more difficult to access nested meta data.

May be related to: #21537

Expected Output

    0    state shortname  info.governor
0   D  Florida        FL            Rick Scott
1   a  Florida        FL            Rick Scott
2   d  Florida        FL            Rick Scott
3   e  Florida        FL            Rick Scott
4   B  Florida        FL            Rick Scott
5   r  Florida        FL            Rick Scott
6   o  Florida        FL            Rick Scott
7   w  Florida        FL            Rick Scott
8   a  Florida        FL            Rick Scott
9   r  Florida        FL            Rick Scott
10  d  Florida        FL            Rick Scott
11  P  Florida        FL            Rick Scott
12  a  Florida        FL            Rick Scott
13  l  Florida        FL            Rick Scott
14  m  Florida        FL            Rick Scott
15     Florida        FL            Rick Scott
16  B  Florida        FL            Rick Scott
17  e  Florida        FL            Rick Scott
18  a  Florida        FL            Rick Scott

Note that ['counties', 'name'] is an arbitrary list of strings to use as a record path, and that this example is contrived (who really needs a table comprised of each letter of a string?). However, many real scenarios can be constructed that require this sort of nested record_path extraction along with nested meta path extraction.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.125-linuxkit
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.15.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Jul 5, 2019
@WillAyd
Copy link
Member

WillAyd commented Jul 5, 2019

I think makes sense so as long as the nested meta path points to an object and not an array. Would welcome a PR

cc @bhavaniravi

@WillAyd WillAyd added this to the Contributions Welcome milestone Jul 5, 2019
@another-green
Copy link
Contributor

I am working with JSON a lot recently. I will give a try.

@LTe
Copy link
Contributor

LTe commented Apr 17, 2020

Test proposed by @yanglinlee https://github.com/pandas-dev/pandas/pull/27667/files#diff-490b1195dddd7f1523a30f865801ed17R291-R313

def test_nested_meta_path_with_nested_record_path(self, state_data):
    # GH 27220
    result = json_normalize(
        data=state_data,
        record_path=["counties", "name"],
        meta=["state", "shortname", ["info", "governor"]],
        errors="ignore",
    )

    ex_data = [
        [
            i
            for word in ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"]
            for i in word
        ],
        ["Florida"] * 21 + ["Ohio"] * 14,
        ["FL"] * 21 + ["OH"] * 14,
        ["Rick Scott"] * 21 + ["John Kasich"] * 14,
    ]
    expected = DataFrame(ex_data).T
    expected.columns = [0, "state", "shortname", "info.governor"]

    tm.assert_frame_equal(result, expected)

After some changes to work with new panda version https://github.com/LTe/pandas/tree/json_improve

diff --git a/doc/source/whatsnew/v1.1.0.rst b/doc/source/whatsnew/v1.1.0.rst
index 2a641a37b..af62031fa 100644
--- a/doc/source/whatsnew/v1.1.0.rst
+++ b/doc/source/whatsnew/v1.1.0.rst
@@ -576,6 +576,7 @@ I/O
 - Bug in :meth:`read_excel` did not correctly handle multiple embedded spaces in OpenDocument text cells. (:issue:`32207`)
 - Bug in :meth:`read_json` was raising ``TypeError`` when reading a list of booleans into a Series. (:issue:`31464`)
 - Bug in :func:`pandas.io.json.json_normalize` where location specified by `record_path` doesn't point to an array. (:issue:`26284`)
+- Bug in :meth:`pandas.io.json.json_normalize` when nested meta paths with a nested record path. (:issue:`27220`)

 Plotting
 ^^^^^^^^
diff --git a/pandas/tests/io/json/test_normalize.py b/pandas/tests/io/json/test_normalize.py
index 4a32f3809..6624b42b1 100644
--- a/pandas/tests/io/json/test_normalize.py
+++ b/pandas/tests/io/json/test_normalize.py
@@ -284,6 +284,27 @@ class TestJSONNormalize:
         expected = DataFrame(ex_data, columns=result.columns)
         tm.assert_frame_equal(result, expected)

+    def test_nested_meta_path_with_nested_record_path(self, state_data):
+        # GH 27220
+        result = json_normalize(
+            data=state_data,
+            record_path=["counties"],
+            meta=["state", "shortname", ["info", "governor"]],
+            errors="ignore",
+        )
+        ex_data = {
+            "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
+            "population": [12345, 40000, 60000, 1234, 1337],
+            "state": ["Florida"] * 3 + ["Ohio"] * 2,
+            "shortname": ["FL"] * 3 + ["OH"] * 2,
+            "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
+        }
+        expected = DataFrame(
+            ex_data,
+            columns=["name", "population", "state", "shortname", "info.governor"],
+        )
+        tm.assert_frame_equal(result, expected)
+
     def test_meta_name_conflict(self):
         data = [
             {

is working just fine with current master. CI run: https://dev.azure.com/piotrnielacny/piotrnielacny/_build/results?buildId=9&view=results

Even there is almost the same test right now here:

def test_shallow_nested(self):
data = [
{
"state": "Florida",
"shortname": "FL",
"info": {"governor": "Rick Scott"},
"counties": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000},
],
},
{
"state": "Ohio",
"shortname": "OH",
"info": {"governor": "John Kasich"},
"counties": [
{"name": "Summit", "population": 1234},
{"name": "Cuyahoga", "population": 1337},
],
},
]
result = json_normalize(
data, "counties", ["state", "shortname", ["info", "governor"]]
)
ex_data = {
"name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
"state": ["Florida"] * 3 + ["Ohio"] * 2,
"shortname": ["FL", "FL", "FL", "OH", "OH"],
"info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
"population": [12345, 40000, 60000, 1234, 1337],
}
expected = DataFrame(ex_data, columns=result.columns)
tm.assert_frame_equal(result, expected)

def test_shallow_nested(self):
    data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": "Rick Scott"},
            "counties": [
                {"name": "Dade", "population": 12345},
                {"name": "Broward", "population": 40000},
                {"name": "Palm Beach", "population": 60000},
            ],
        },
        {
            "state": "Ohio",
            "shortname": "OH",
            "info": {"governor": "John Kasich"},
            "counties": [
                {"name": "Summit", "population": 1234},
                {"name": "Cuyahoga", "population": 1337},
            ],
        },
    ]

    result = json_normalize(
        data, "counties", ["state", "shortname", ["info", "governor"]]
    )
    ex_data = {
        "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
        "state": ["Florida"] * 3 + ["Ohio"] * 2,
        "shortname": ["FL", "FL", "FL", "OH", "OH"],
        "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
        "population": [12345, 40000, 60000, 1234, 1337],
    }
    expected = DataFrame(ex_data, columns=result.columns)
    tm.assert_frame_equal(result, expected)

I think this issue is fixed in master and we can close it.

@LTe
Copy link
Contributor

LTe commented Apr 20, 2020

cc @WillAyd @jreback

@mroeschke
Copy link
Member

Happy to add that test to the test suite as a regression test

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed IO JSON read_json, to_json, json_normalize labels May 8, 2020
@MarcoGorelli
Copy link
Member

I think this issue is fixed in master and we can close it.

I'm probably missing something (if so, no need to reply, sorry for the disruption) but isn't the original issue still not working? If I run

from pandas import json_normalize

data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "counties": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000},
        ],
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "counties": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337},
        ],
    },
]

print(
    json_normalize(
        data,
        ["counties", "name"],
        ["state", "shortname", ["info", "governor"]],
        errors="ignore",
    )
)

on master then I get

TypeError: {'name': 'Dade', 'population': 12345} has non list value Dade for path name. Must be list or null.

It seems to me that the proposed regression test is for the case

print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]))

which, according to the OP, was already working

@LTe
Copy link
Contributor

LTe commented May 9, 2020

@MarcoGorelli TypeError is expected, because string is not treated as a list. But I think regression test is already there:

deep_nested, ["states", "cities"], meta=["country", ["states", "name"]]

@jreback jreback modified the milestones: Contributions Welcome, 1.1 May 9, 2020
@jreback jreback added the IO JSON read_json, to_json, json_normalize label May 9, 2020
@seidnerj
Copy link

seidnerj commented Sep 16, 2024

Hey all,

Not sure if this is a regression or a different issue, but I'm seeing this issue with a slight variation. The following works fine:

data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": 
                    "Rick Scott"},
            "counties": {
                "nodes": [
                    {"name": "Dade", "population": 12345},
                    {"name": "Broward", "population": 40000},
                    {"name": "Palm Beach", "population": 60000},
                ],
                "__typename": "County"
            }
        }
    ]

pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info"]])
         name  population    state shortname                        info
0        Dade       12345  Florida        FL  {'governor': 'Rick Scott'}
1     Broward       40000  Florida        FL  {'governor': 'Rick Scott'}
2  Palm Beach       60000  Florida        FL  {'governor': 'Rick Scott'}

However, the following does not:

pd.json_normalize(data, record_path=["counties", "nodes"], meta=["state", "shortname", ["info", "governor"]])
Traceback (most recent call last):
  File "/Users/user/Applications/PyCharm.app/Contents/plugins/python/helpers-pro/pydevd_asyncio/pydevd_asyncio_utils.py", line 117, in _exec_async_code
    result = func()
             ^^^^^^
  File "<input>", line 1, in <module>
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 517, in json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 496, in _recursive_extract
    _recursive_extract(obj[path[0]], path[1:], seen_meta, level=level + 1)
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 513, in _recursive_extract
    meta_val = _pull_field(obj, val[level:])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Documents/Code/.venv/lib/python3.11/site-packages/pandas/io/json/_normalize.py", line 408, in _pull_field
    raise KeyError(
KeyError: "Key 'governor' not found. To replace missing values of 'governor' with np.nan, pass in errors='ignore'"

Dropping the "nodes" key again gets this working:

data = [
        {
            "state": "Florida",
            "shortname": "FL",
            "info": {"governor": 
                    "Rick Scott"},
            "counties": [
                    {"name": "Dade", "population": 12345},
                    {"name": "Broward", "population": 40000},
                    {"name": "Palm Beach", "population": 60000}
                ]
        }
    ]

pd.json_normalize(data, record_path=["counties"], meta=["state", "shortname", ["info", "governor"]])
         name  population    state shortname info.governor
0        Dade       12345  Florida        FL    Rick Scott
1     Broward       40000  Florida        FL    Rick Scott
2  Palm Beach       60000  Florida        FL    Rick Scott

These are the versions I'm using (dropped all "None" for brevity)

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.11.8.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 24.0.0
Version               : Darwin Kernel Version 24.0.0: Mon Aug 12 20:51:54 PDT 2024; root:xnu-11215.1.10~2/RELEASE_ARM64_T6000
machine               : arm64
processor             : arm
byteorder             : little
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8
pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.2
dateutil              : 2.9.0.post0
setuptools            : 74.1.2
pip                   : 24.2
html5lib              : 1.1
jinja2                : 3.1.4
tzdata                : 2024.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue IO JSON read_json, to_json, json_normalize Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
8 participants