Skip to content

PERF: json_normalize #15621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chris-b1 opened this issue Mar 8, 2017 · 6 comments · Fixed by #40035
Closed

PERF: json_normalize #15621

chris-b1 opened this issue Mar 8, 2017 · 6 comments · Fixed by #40035
Assignees
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Milestone

Comments

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 8, 2017

I haven't looked much at the implementation, but guessing simpler cases like this could be optimized.

In [63]: data = [
    ...:     {'name': 'Name',
    ...:      'value': 1.0,
    ...:      'value2': 2.0,
    ...:      'nested': {'a': 'aa', 'b': 'bb'}}] * 1000000

In [64]: %timeit pd.DataFrame(data)
1 loop, best of 3: 847 ms per loop

In [65]: %timeit pd.io.json.json_normalize(data)
1 loop, best of 3: 20 s per loop

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: None
html5lib: 0.999999999
httplib2: 0.9.2
apiclient: 1.5.3
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: 0.2.1

@chris-b1 chris-b1 added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Mar 8, 2017
@jreback
Copy link
Contributor

jreback commented Mar 8, 2017

yeah this is all in python code :<

IIRC @wesm has a plan for this in pandas2, so maybe it would be possible to make use of some of that.

@wesm
Copy link
Member

wesm commented Mar 12, 2017

Converting lists of dictionaries faster in json_normalize seems perfectly reasonable. I intend to use RapidJSON (https://github.com/miloyip/nativejson-benchmark) to create a faster native JSON->DataFrame reader, since we can circumvent Python objects altogether that way. This can happen well before pandas2 ships by using Arrow tables an intermediary en route to pandas

@eewallace
Copy link

Not sure if this is still on anyone's radar, but I've been dealing with a performance issue at least partly caused by json_normalize. From some profiling, it seems like the biggest problem for my case is the use of deepcopy. For common relatively simple cases of just dictionaries/lists of string and numeric literals, deepcopy seems like a lot of unnecessary overhead. Even if it's needed for some use cases, calling it recursively (when it is doing its own recursive copy) is surely not optimal.

@zouhairm
Copy link

zouhairm commented Jan 17, 2020

any updates on fixing this or suggestions for workarounds (maybe some other library that flattens the dictionary?)


I found this library https://pypi.org/project/flatten-dict/ that seems to make things a bit faster than pd.io.json.json_normalize

def json_normalize(arr):
      reducer  = lambda k1, k2: k2 if k1 is None else k1+'.'+k2
      flat_arr = [flatten_dict.flatten(i,reducer=reducer) for i in arr]
      return pd.DataFrame(flat_arr)

@smpurkis
Copy link
Contributor

I'm happy to take a look and potentially make a pull request.

I wrote pure python implementation here.
Similar performance @zouhairm's implementation, slightly faster though

I found this library https://pypi.org/project/flatten-dict/ that seems to make things a bit faster than pd.io.json.json_normalize

def json_normalize(arr):
      reducer  = lambda k1, k2: k2 if k1 is None else k1+'.'+k2
      flat_arr = [flatten_dict.flatten(i,reducer=reducer) for i in arr]
      return pd.DataFrame(flat_arr)

brief benchmark:

pandas (v1.2.2) implementation time taken: 26.58 seconds
zouhairm implementation time taken: 4.04 seconds
My implementation time taken: 3.51 seconds

@smpurkis
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants