PERF: json_normalize #15621

chris-b1 · 2017-03-08T19:06:09Z

I haven't looked much at the implementation, but guessing simpler cases like this could be optimized.

In [63]: data = [
    ...:     {'name': 'Name',
    ...:      'value': 1.0,
    ...:      'value2': 2.0,
    ...:      'nested': {'a': 'aa', 'b': 'bb'}}] * 1000000

In [64]: %timeit pd.DataFrame(data)
1 loop, best of 3: 847 ms per loop

In [65]: %timeit pd.io.json.json_normalize(data)
1 loop, best of 3: 20 s per loop

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: None
html5lib: 0.999999999
httplib2: 0.9.2
apiclient: 1.5.3
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-08T19:40:47Z

yeah this is all in python code :<

IIRC @wesm has a plan for this in pandas2, so maybe it would be possible to make use of some of that.

wesm · 2017-03-12T04:14:03Z

Converting lists of dictionaries faster in json_normalize seems perfectly reasonable. I intend to use RapidJSON (https://github.com/miloyip/nativejson-benchmark) to create a faster native JSON->DataFrame reader, since we can circumvent Python objects altogether that way. This can happen well before pandas2 ships by using Arrow tables an intermediary en route to pandas

eewallace · 2019-10-15T04:17:37Z

Not sure if this is still on anyone's radar, but I've been dealing with a performance issue at least partly caused by json_normalize. From some profiling, it seems like the biggest problem for my case is the use of deepcopy. For common relatively simple cases of just dictionaries/lists of string and numeric literals, deepcopy seems like a lot of unnecessary overhead. Even if it's needed for some use cases, calling it recursively (when it is doing its own recursive copy) is surely not optimal.

zouhairm · 2020-01-17T01:17:06Z

any updates on fixing this or suggestions for workarounds (maybe some other library that flattens the dictionary?)

I found this library https://pypi.org/project/flatten-dict/ that seems to make things a bit faster than pd.io.json.json_normalize

def json_normalize(arr):
      reducer  = lambda k1, k2: k2 if k1 is None else k1+'.'+k2
      flat_arr = [flatten_dict.flatten(i,reducer=reducer) for i in arr]
      return pd.DataFrame(flat_arr)

smpurkis · 2021-02-23T11:29:31Z

I'm happy to take a look and potentially make a pull request.

I wrote pure python implementation here.
Similar performance @zouhairm's implementation, slightly faster though

I found this library https://pypi.org/project/flatten-dict/ that seems to make things a bit faster than pd.io.json.json_normalize
def json_normalize(arr):
      reducer  = lambda k1, k2: k2 if k1 is None else k1+'.'+k2
      flat_arr = [flatten_dict.flatten(i,reducer=reducer) for i in arr]
      return pd.DataFrame(flat_arr)

brief benchmark:

pandas (v1.2.2) implementation time taken: 26.58 seconds
zouhairm implementation time taken: 4.04 seconds
My implementation time taken: 3.51 seconds

smpurkis · 2021-02-23T11:31:40Z

take

chris-b1 added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Mar 8, 2017

github-actions bot assigned smpurkis Feb 23, 2021

smpurkis mentioned this issue Feb 24, 2021

PERF: json_normalize, for basic use case #40035

Merged

3 tasks

jreback added this to the 1.3 milestone Feb 27, 2021

jreback closed this as completed in #40035 Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: json_normalize #15621

PERF: json_normalize #15621

chris-b1 commented Mar 8, 2017

jreback commented Mar 8, 2017

wesm commented Mar 12, 2017

eewallace commented Oct 15, 2019

zouhairm commented Jan 17, 2020 •

edited

Loading

smpurkis commented Feb 23, 2021

smpurkis commented Feb 23, 2021

PERF: json_normalize #15621

PERF: json_normalize #15621

Comments

chris-b1 commented Mar 8, 2017

Output of pd.show_versions()

jreback commented Mar 8, 2017

wesm commented Mar 12, 2017

eewallace commented Oct 15, 2019

zouhairm commented Jan 17, 2020 • edited Loading

smpurkis commented Feb 23, 2021

smpurkis commented Feb 23, 2021

Output of `pd.show_versions()`

zouhairm commented Jan 17, 2020 •

edited

Loading