Skip to content

PERF: 5x speedup for read_json() with orient='index' by avoiding transpose #26773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 17, 2019

Conversation

qwhelan
Copy link
Contributor

@qwhelan qwhelan commented Jun 10, 2019

The .T operator can be quite slow on mixed-type DataFrames due to the creation of object dtype columns. In comparison to direct construction with DataFrame.from_dict() can generally be much more efficient.

Making that swap inside pd.read_json() yields a ~5-6x speedup for the orient='index' case:

       before           after         ratio
     [d47fc0cb]       [b0fd99ec]
     <read_json_speedup~1>       <read_json_speedup>
-      5.37±0.03s          907±5ms     0.17  io.json.ReadJSON.time_read_json('index', 'int')
-      5.27±0.01s          804±3ms     0.15  io.json.ReadJSON.time_read_json('index', 'datetime')
  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2019

cc @TomAugspurger would this be related to #24387 at all?

@WillAyd WillAyd added the Performance Memory or execution speed performance label Jun 10, 2019
@WillAyd
Copy link
Member

WillAyd commented Jun 10, 2019

Ignore previous comment was too focused on the constructor and not the transposition. This makes sense to me

@codecov
Copy link

codecov bot commented Jun 10, 2019

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641
Flag Coverage Δ
#multiple ?
#single 41.21% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 63.17% <0%> (-30.07%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Jun 10, 2019

Codecov Report

Merging #26773 into master will decrease coverage by 50.5%.
The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26773       +/-   ##
===========================================
- Coverage   91.71%   41.21%   -50.51%     
===========================================
  Files         178      178               
  Lines       50771    50771               
===========================================
- Hits        46567    20926    -25641     
- Misses       4204    29845    +25641
Flag Coverage Δ
#multiple ?
#single 41.21% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/json/json.py 63.17% <0%> (-30.07%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/plotting/_matplotlib/__init__.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
pandas/core/sparse/scipy_sparse.py 10.14% <0%> (-89.86%) ⬇️
... and 133 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d47fc0c...b0fd99e. Read the comment docs.

@alimcmaster1
Copy link
Member

nice! @qwhelan mind taking a look at the test cases ( looks like this changes the order of the index ) https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=12630

>   raise_assert_detail(obj, msg, lobj, robj)
E   AssertionError: DataFrame.columns are different
E   
E   DataFrame.columns values are different (100.0 %)
E   [left]:  Index(['A', 'B', 'C', 'D'], dtype='object')
E   [right]: Index(['D', 'C', 'B', 'A'], dtype='object')

@qwhelan
Copy link
Contributor Author

qwhelan commented Jun 10, 2019

@alimcmaster1 Given that this only fails on 3.5, I'm guessing this is a dict-orderedness issue in from_dict()

@jreback jreback added the IO JSON read_json, to_json, json_normalize label Jun 27, 2019
@qwhelan qwhelan force-pushed the read_json_speedup branch 2 times, most recently from d77a2a2 to 5edd63c Compare July 8, 2019 05:44
@jreback jreback added this to the 0.25.0 milestone Jul 8, 2019
@jreback
Copy link
Contributor

jreback commented Jul 8, 2019

lgtm, can you add a note in Performance for 0.25.0, ping on green.

@qwhelan qwhelan force-pushed the read_json_speedup branch from 5edd63c to cef3d80 Compare July 8, 2019 14:50
@jreback jreback merged commit a373e0e into pandas-dev:master Jul 17, 2019
@jreback
Copy link
Contributor

jreback commented Jul 17, 2019

thanks @qwhelan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants