You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't think this behaviour is expected or documented.
I think the expected behaviour would be to produce a 'd.dd' column containing 'meta' strings. Mostly because this would be the result if the record_path were not nested.
Having a shot at debugging the code in pandas/io/json/_normalize.py, I suspect I know where the problem is.
In the function _recursive_extract(), at the top, we have:
def_recursive_extract(data, path, seen_meta, level=0):
ifisinstance(data, dict):
data= [data]
iflen(path) >1:
forobjindata:
forval, keyinzip(_meta, meta_keys):
iflevel+1==len(val):
seen_meta[key] =_pull_field(obj, val[-1])
# problematic line V--- object is recursed into, which cuts off remaining metadata that also needs to be recursed into_recursive_extract(obj[path[0]], path[1:], seen_meta, level=level+1)
In the line at the bottom of the extract above, the data object is recursed into, because the record_path (the path variable here) is nested.
But as a result the next iteration only has access to the next level of the record_path and nothing else including the whole of the meta data that also needs to be recursed into. The remaining code in _recursive_extract tries to perform just this recursion into the remaining metadata, but can't find any metadata and so fails.
This can be more or less proven by an absurd example that doesn't fail.
Here, the meta path will be partially (but not completely) duplicated within the record_path.
test_data= {'a': 1,
'b': 2,
'c': {'cc': [333, 444], 'dd': 'wrong_meta'}, # 'dd' is a duplication of the second level of the meta path'd': {'dd': 'meta'} # original meta_path
}
And now the code from above will not fail, but produce the wrong but different output again.
This produces the table below where the 'dd' key within the dict at 'c' has been taken up as the metadata because it does survive the recursion process.
0
a
d.dd
0
333
1
wrong_meta
1
444
1
wrong_meta
Though I'm not on top of this code base, the solution could be adjusting _recursive_extract to take a separate metadata argument so that it can be separately recursed into, or, somehow separating the treatment of records and metadata at the appropriate node in the tree of the data.
Expected Output
To produce a 'd.dd' column containing 'meta' strings.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : f2c8480
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.186-146.268.amzn2.x86_64
Version : #1 SMP Tue Jul 14 18:16:52 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
[X ] I have checked that this issue has not already been reported.
[ X] I have confirmed this bug exists on the latest version of pandas.
[X ] (optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
See related #27220
I don't think this behaviour is expected or documented.
I think the expected behaviour would be to produce a 'd.dd' column containing 'meta' strings. Mostly because this would be the result if the
record_path
were not nested.Having a shot at debugging the code in
pandas/io/json/_normalize.py
, I suspect I know where the problem is.In the function
_recursive_extract()
, at the top, we have:In the line at the bottom of the extract above, the data object is recursed into, because the
record_path
(thepath
variable here) is nested.But as a result the next iteration only has access to the next level of the
record_path
and nothing else including the whole of the meta data that also needs to be recursed into. The remaining code in_recursive_extract
tries to perform just this recursion into the remaining metadata, but can't find any metadata and so fails.This can be more or less proven by an absurd example that doesn't fail.
Here, the
meta
path will be partially (but not completely) duplicated within therecord_path
.And now the code from above will not fail, but produce the wrong but different output again.
This produces the table below where the
'dd'
key within the dict at'c'
has been taken up as the metadata because it does survive the recursion process.Though I'm not on top of this code base, the solution could be adjusting
_recursive_extract
to take a separate metadata argument so that it can be separately recursed into, or, somehow separating the treatment of records and metadata at the appropriate node in the tree of the data.Expected Output
To produce a 'd.dd' column containing 'meta' strings.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : f2c8480
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.186-146.268.amzn2.x86_64
Version : #1 SMP Tue Jul 14 18:16:52 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.3
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 49.2.0.post20200712
Cython : 0.29.21
pytest : 6.2.1
hypothesis : 6.0.2
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 0.17.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.48.0
The text was updated successfully, but these errors were encountered: