-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
json_normalize does not handle nested meta paths when also using a nested record_path #27220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think makes sense so as long as the nested meta path points to an object and not an array. Would welcome a PR cc @bhavaniravi |
I am working with JSON a lot recently. I will give a try. |
Test proposed by @yanglinlee https://github.com/pandas-dev/pandas/pull/27667/files#diff-490b1195dddd7f1523a30f865801ed17R291-R313 def test_nested_meta_path_with_nested_record_path(self, state_data):
# GH 27220
result = json_normalize(
data=state_data,
record_path=["counties", "name"],
meta=["state", "shortname", ["info", "governor"]],
errors="ignore",
)
ex_data = [
[
i
for word in ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"]
for i in word
],
["Florida"] * 21 + ["Ohio"] * 14,
["FL"] * 21 + ["OH"] * 14,
["Rick Scott"] * 21 + ["John Kasich"] * 14,
]
expected = DataFrame(ex_data).T
expected.columns = [0, "state", "shortname", "info.governor"]
tm.assert_frame_equal(result, expected) After some changes to work with new panda version https://github.com/LTe/pandas/tree/json_improve diff --git a/doc/source/whatsnew/v1.1.0.rst b/doc/source/whatsnew/v1.1.0.rst
index 2a641a37b..af62031fa 100644
--- a/doc/source/whatsnew/v1.1.0.rst
+++ b/doc/source/whatsnew/v1.1.0.rst
@@ -576,6 +576,7 @@ I/O
- Bug in :meth:`read_excel` did not correctly handle multiple embedded spaces in OpenDocument text cells. (:issue:`32207`)
- Bug in :meth:`read_json` was raising ``TypeError`` when reading a list of booleans into a Series. (:issue:`31464`)
- Bug in :func:`pandas.io.json.json_normalize` where location specified by `record_path` doesn't point to an array. (:issue:`26284`)
+- Bug in :meth:`pandas.io.json.json_normalize` when nested meta paths with a nested record path. (:issue:`27220`)
Plotting
^^^^^^^^
diff --git a/pandas/tests/io/json/test_normalize.py b/pandas/tests/io/json/test_normalize.py
index 4a32f3809..6624b42b1 100644
--- a/pandas/tests/io/json/test_normalize.py
+++ b/pandas/tests/io/json/test_normalize.py
@@ -284,6 +284,27 @@ class TestJSONNormalize:
expected = DataFrame(ex_data, columns=result.columns)
tm.assert_frame_equal(result, expected)
+ def test_nested_meta_path_with_nested_record_path(self, state_data):
+ # GH 27220
+ result = json_normalize(
+ data=state_data,
+ record_path=["counties"],
+ meta=["state", "shortname", ["info", "governor"]],
+ errors="ignore",
+ )
+ ex_data = {
+ "name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
+ "population": [12345, 40000, 60000, 1234, 1337],
+ "state": ["Florida"] * 3 + ["Ohio"] * 2,
+ "shortname": ["FL"] * 3 + ["OH"] * 2,
+ "info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
+ }
+ expected = DataFrame(
+ ex_data,
+ columns=["name", "population", "state", "shortname", "info.governor"],
+ )
+ tm.assert_frame_equal(result, expected)
+
def test_meta_name_conflict(self):
data = [
{ is working just fine with current master. CI run: https://dev.azure.com/piotrnielacny/piotrnielacny/_build/results?buildId=9&view=results Even there is almost the same test right now here: pandas/pandas/tests/io/json/test_normalize.py Lines 251 to 285 in cd52502
def test_shallow_nested(self):
data = [
{
"state": "Florida",
"shortname": "FL",
"info": {"governor": "Rick Scott"},
"counties": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000},
],
},
{
"state": "Ohio",
"shortname": "OH",
"info": {"governor": "John Kasich"},
"counties": [
{"name": "Summit", "population": 1234},
{"name": "Cuyahoga", "population": 1337},
],
},
]
result = json_normalize(
data, "counties", ["state", "shortname", ["info", "governor"]]
)
ex_data = {
"name": ["Dade", "Broward", "Palm Beach", "Summit", "Cuyahoga"],
"state": ["Florida"] * 3 + ["Ohio"] * 2,
"shortname": ["FL", "FL", "FL", "OH", "OH"],
"info.governor": ["Rick Scott"] * 3 + ["John Kasich"] * 2,
"population": [12345, 40000, 60000, 1234, 1337],
}
expected = DataFrame(ex_data, columns=result.columns)
tm.assert_frame_equal(result, expected) I think this issue is fixed in master and we can close it. |
Happy to add that test to the test suite as a regression test |
I'm probably missing something (if so, no need to reply, sorry for the disruption) but isn't the original issue still not working? If I run from pandas import json_normalize
data = [
{
"state": "Florida",
"shortname": "FL",
"info": {"governor": "Rick Scott"},
"counties": [
{"name": "Dade", "population": 12345},
{"name": "Broward", "population": 40000},
{"name": "Palm Beach", "population": 60000},
],
},
{
"state": "Ohio",
"shortname": "OH",
"info": {"governor": "John Kasich"},
"counties": [
{"name": "Summit", "population": 1234},
{"name": "Cuyahoga", "population": 1337},
],
},
]
print(
json_normalize(
data,
["counties", "name"],
["state", "shortname", ["info", "governor"]],
errors="ignore",
)
) on master then I get TypeError: {'name': 'Dade', 'population': 12345} has non list value Dade for path name. Must be list or null. It seems to me that the proposed regression test is for the case print(json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])) which, according to the OP, was already working |
@MarcoGorelli pandas/pandas/tests/io/json/test_normalize.py Line 223 in cd52502
|
Hey all, Not sure if this is a regression or a different issue, but I'm seeing this issue with a slight variation. The following works fine:
However, the following does not:
Dropping the "nodes" key again gets this working:
These are the versions I'm using (dropped all "None" for brevity)
|
Code Sample
Returns
Problem description
Running json_normalize on a nested record_path with a nested meta argument gives an error that says it cannot find info.governer.
This is inconsistent because running it again, without the nested meta argument, successfully returns the data:
Similarly, using a non-nested record path also works (in fact, this is the exact sample example that can be found in the json_normalize pandas documentation).
The result is that it is much more difficult to access nested meta data.
May be related to: #21537
Expected Output
Note that
['counties', 'name']
is an arbitrary list of strings to use as a record path, and that this example is contrived (who really needs a table comprised of each letter of a string?). However, many real scenarios can be constructed that require this sort of nested record_path extraction along with nested meta path extraction.Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.125-linuxkit
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: None
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.15.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: None
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: