Skip to content

Enhanced json normalize #23861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
cb53be7
ENH add max_level and ignore_keys configuration to nested_to_records
bhavaniravi Nov 22, 2018
0972746
ENH extend max_level and ignore keys to
bhavaniravi Nov 22, 2018
5a5c708
fix pep8 issues
bhavaniravi Nov 22, 2018
be7ec0e
add whatsnew to doc string
bhavaniravi Nov 22, 2018
a79e126
add testcase with large max_level
bhavaniravi Nov 23, 2018
cd12a23
add explation for flatten if condition
bhavaniravi Nov 23, 2018
d3b3503
update doc_string and built documentation
bhavaniravi Nov 23, 2018
4ec60bc
fix json normalize records path issue
bhavaniravi Nov 27, 2018
e001264
Merge branch 'master' into enhanced_json_normalize
bhavaniravi Nov 27, 2018
5c88339
Merge branch 'master' of git://github.com/pandas-dev/pandas into json…
bhavaniravi Dec 30, 2018
55f7b1c
fix merge conflict
bhavaniravi Jan 3, 2019
1af2bfc
fix testcase error
bhavaniravi Jan 3, 2019
882a2ca
add nested flattening to json_normalize
bhavaniravi Jan 3, 2019
caba6db
fixed pep8 issues
bhavaniravi Jan 3, 2019
4e22c69
fix merge conflict
bhavaniravi Jan 3, 2019
c2eff85
fix issues with doc string
bhavaniravi Jan 4, 2019
247124f
modify test case to paramaetized
bhavaniravi Jan 4, 2019
ab15869
fix issues with pep8
bhavaniravi Jan 10, 2019
26bf967
fix pep8 build fail
bhavaniravi Jan 16, 2019
fca2a27
fix testcase failure, inconsistent column order
bhavaniravi Feb 5, 2019
7a58456
fix documentation issues
bhavaniravi Mar 19, 2019
f3d25e3
fix merge conflicts with upstream
bhavaniravi Mar 19, 2019
7a1297d
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 20, 2019
177c750
fix testcase failure np.nan converted into str on line 328
bhavaniravi Apr 20, 2019
cb82bca
remove get_pip file
bhavaniravi Apr 20, 2019
2a7b966
rename test func test_max_level_with_record_prefix
bhavaniravi Apr 20, 2019
4635591
fix pep8 over-intended line
bhavaniravi Apr 21, 2019
22fd84e
fix docstring formatting issues
bhavaniravi Apr 21, 2019
2e407e3
convert to a fixture
bhavaniravi Apr 21, 2019
cf27cae
convert to inline data
bhavaniravi Apr 21, 2019
124fbd9
fix docstring formatting issues
bhavaniravi Apr 21, 2019
7b65999
fix docstring formatting issues
bhavaniravi Apr 21, 2019
03d3d23
add github issue id to test case
bhavaniravi Apr 22, 2019
8e61a04
fix pep8 flake issues
bhavaniravi Apr 22, 2019
b808d5a
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 22, 2019
0eaea30
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 23, 2019
837ba18
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 26, 2019
217d4ae
Merge branch 'master' of git://github.com/pandas-dev/pandas into enha…
bhavaniravi Apr 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 86 additions & 36 deletions pandas/io/json/normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,11 @@ def _convert_to_line_delimits(s):
return convert_json_to_lines(s)


def nested_to_record(ds, prefix="", sep=".", level=0):
def nested_to_record(ds, prefix="", sep=".", level=0,
max_level=None, ignore_keys=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u type these parameters

"""
A simplified json_normalize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert the change to this line?

A simplified json_normalize

Converts a nested dict into a flat dict ("record"), unlike json_normalize,
it does not attempt to extract a subset of the data.
Expand All @@ -36,13 +38,24 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
----------
ds : dict or list of dicts
prefix: the prefix, optional, default: ""
sep : string, default '.'
sep : str, default '.'
Nested records will generate names separated by sep,
e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar

.. versionadded:: 0.20.0

level: the number of levels in the jason string, optional, default: 0
level: int, optional, default: 0
The number of levels in the json string.

max_level: int, optional, default: None
The max depth to normalize.

.. versionadded:: 0.25.0

ignore_keys: list, optional, default None
keys to ignore

.. versionadded:: 0.25.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not lined up


Returns
-------
Expand All @@ -65,10 +78,9 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
if isinstance(ds, dict):
ds = [ds]
singleton = True

ignore_keys = ignore_keys if ignore_keys else []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this getting mutated?

new_ds = []
for d in ds:

new_d = copy.deepcopy(d)
for k, v in d.items():
# each key gets renamed with prefix
Expand All @@ -79,16 +91,21 @@ def nested_to_record(ds, prefix="", sep=".", level=0):
else:
newkey = prefix + sep + k

# only dicts gets recurse-flattend
# flatten if type is dict and
# current dict level < maximum level provided and
# current dict key not in ignore keys list flatten it
# only at level>1 do we rename the rest of the keys
if not isinstance(v, dict):
if (not isinstance(v, dict) or
(max_level is not None and level >= max_level) or
(k in ignore_keys)):
if level != 0: # so we skip copying for top level, common case
v = new_d.pop(k)
new_d[newkey] = v
continue
else:
v = new_d.pop(k)
new_d.update(nested_to_record(v, newkey, sep, level + 1))
new_d.update(nested_to_record(v, newkey, sep, level + 1,
max_level, ignore_keys))
new_ds.append(new_d)

if singleton:
Expand All @@ -100,41 +117,57 @@ def json_normalize(data, record_path=None, meta=None,
meta_prefix=None,
record_prefix=None,
errors='raise',
sep='.'):
sep='.',
max_level=None,
ignore_keys=None):
"""
Normalize semi-structured JSON data into a flat table.

Parameters
----------
data : dict or list of dicts
Unserialized JSON objects
record_path : string or list of strings, default None
Unserialized JSON objects.
record_path : str or list of str, default None
Path in each object to list of records. If not passed, data will be
assumed to be an array of records
meta : list of paths (string or list of strings), default None
Fields to use as metadata for each record in resulting table
meta_prefix : string, default None
record_prefix : string, default None
assumed to be an array of records.
meta : list of paths (str or list of str), default None
Fields to use as metadata for each record in resulting table.
meta_prefix : str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if
meta is ['foo', 'bar'].
record_prefix : str, default None
If True, prefix records with dotted (?) path, e.g. foo.bar.field if
path to records is ['foo', 'bar']
path to records is ['foo', 'bar'].
errors : {'raise', 'ignore'}, default 'raise'

Configures error handling.
* 'ignore' : will ignore KeyError if keys listed in meta are not
always present
always present.
* 'raise' : will raise KeyError if keys listed in meta are not
always present
always present.

.. versionadded:: 0.20.0

sep : string, default '.'
Nested records will generate names separated by sep,
e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar
sep : str, default '.'
Nested records will generate names separated by sep.
e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar.

.. versionadded:: 0.20.0

max_level : int, default None
Max number of levels(depth of dict) to normalize.
if None, normalizes all levels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor capitalization issue here


.. versionadded:: 0.25.0

ignore_keys : list, keys to ignore, default None
List of keys that you do not want to normalize.

.. versionadded:: 0.25.0

Returns
-------
frame : DataFrame
Returns a JSON normalized Dataframe.

Examples
--------
Expand All @@ -149,6 +182,20 @@ def json_normalize(data, record_path=None, meta=None,
1 NaN NaN Regner NaN Mose NaN
2 2.0 Faye Raker NaN NaN NaN NaN

>>> from pandas.io.json import json_normalize
>>> data = [{'id': 1,
... 'name': {'first': 'Cole', 'last': 'Volk'},
... 'fitness': {'height': 130, 'weight': 60}},
... {'name': {'given': 'Mose', 'family': 'Reg'},
... 'fitness': {'height': 130, 'weight': 60}},
... {'id': 2, 'name': 'Faye Raker',
... 'fitness': {'height': 130, 'weight': 60}}]
>>> json_normalize(data, max_level=1, ignore_keys=['name'])
fitness.height fitness.weight id name
0 130 60 1.0 {'first': 'Cole', 'last': 'Volk'}
1 130 60 NaN {'given': 'Mose', 'family': 'Reg'}
2 130 60 2.0 Faye Raker

>>> data = [{'state': 'Florida',
... 'shortname': 'FL',
... 'info': {
Expand All @@ -167,12 +214,12 @@ def json_normalize(data, record_path=None, meta=None,
>>> result = json_normalize(data, 'counties', ['state', 'shortname',
... ['info', 'governor']])
>>> result
name population info.governor state shortname
0 Dade 12345 Rick Scott Florida FL
1 Broward 40000 Rick Scott Florida FL
2 Palm Beach 60000 Rick Scott Florida FL
3 Summit 1234 John Kasich Ohio OH
4 Cuyahoga 1337 John Kasich Ohio OH
name population state shortname info.governor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there a period here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because info.governer is nested.

0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich

>>> data = {'A': [1, 2]}
>>> json_normalize(data, 'A', record_prefix='Prefix.')
Expand All @@ -197,6 +244,8 @@ def _pull_field(js, spec):
if isinstance(data, dict):
data = [data]

ignore_keys = ignore_keys if ignore_keys else []

if record_path is None:
if any([isinstance(x, dict) for x in y.values()] for y in data):
# naive normalization, this is idempotent for flat records
Expand All @@ -206,7 +255,9 @@ def _pull_field(js, spec):
#
# TODO: handle record value which are lists, at least error
# reasonably
data = nested_to_record(data, sep=sep)
data = nested_to_record(data, sep=sep,
max_level=max_level,
ignore_keys=ignore_keys)
return DataFrame(data)
elif not isinstance(record_path, list):
record_path = [record_path]
Expand Down Expand Up @@ -241,10 +292,13 @@ def _recursive_extract(data, path, seen_meta, level=0):
else:
for obj in data:
recs = _pull_field(obj, path[0])
recs = [nested_to_record(r, sep=sep,
max_level=max_level,
ignore_keys=ignore_keys)
if isinstance(r, dict) else r for r in recs]

# For repeating the metadata later
lengths.append(len(recs))

for val, key in zip(meta, meta_keys):
if level + 1 > len(val):
meta_val = seen_meta[key]
Expand All @@ -260,7 +314,6 @@ def _recursive_extract(data, path, seen_meta, level=0):
"{err} is not always present"
.format(err=e))
meta_vals[key].append(meta_val)

records.extend(recs)

_recursive_extract(data, record_path, {}, level=0)
Expand All @@ -279,8 +332,5 @@ def _recursive_extract(data, path, seen_meta, level=0):
if k in result:
raise ValueError('Conflicting metadata name {name}, '
'need distinguishing prefix '.format(name=k))

# forcing dtype to object to avoid the metadata being casted to string
result[k] = np.array(v, dtype=object).repeat(lengths)

return result
Loading