Skip to content

PERF: pandas read_excel perf optimisations #47376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

PERF: pandas read_excel perf optimisations #47376

wants to merge 1 commit into from

Conversation

Sanix-Darker
Copy link

@Sanix-Darker Sanix-Darker commented Jun 15, 2022

WHAT

Attempts on perf optimisations for pandas read_excel
by descending the skiprows (as an integer) to get_sheet_data depending on the engine
openpyxl, xlrd, pyxlsb or odf...

STATUS : Working in progress...

BENCHMARKS

BENCH SPECS

OS: Ubuntu 20.04.4 LTS x86_64
Host: 20TH0010FR ThinkPad P1 Gen 3
Kernel: 5.14.0-1038-oem
CPU: Intel i7-10750H (12) @ 5.000GHz
GPU: NVIDIA Quadro T1000 Mobile
GPU: Intel UHD Graphics
Memory: 47849MiB
Python 3.8.6 - [GCC 9.4.0]

BENCH SCRIPT
import pandas as pd
from timeit import default_timer

def bench_mark_func():
    print(f">>> {pd.__version__}")
    for ext in ["xls", "xlsx", "xlsb", "ods"]:
        print(f"\n[{ext}] no nrows, nor skiprows :")
        start = default_timer()
        for i in range(100):
            pd.read_excel(f"./fixtures/benchmark_5000.{ext}")
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] with nrows and skiprows (reading top lines):")
        start = default_timer()
        for i in range(100):
            pd.read_excel(
                f"./fixtures/benchmark_5000.{ext}", nrows=50 * i, skiprows=100 + i
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] with nrows and skiprows (reading middle lines):")
        start = default_timer()
        for i in range(100):
            pd.read_excel(
                f"./fixtures/benchmark_5000.{ext}", nrows=50 * i, skiprows=2000 + i
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] with nrows and skiprows (reading bottom lines):")
        start = default_timer()
        for i in range(100):
            pd.read_excel(
                f"./fixtures/benchmark_5000.{ext}", nrows=50 * i, skiprows=4000 + i
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print("==" * 30)


if __name__ == "__main__":
    bench_mark_func()

Fixtures are available here : fixtures

BENCH REPORTS
xls format - - -
main branch perf branch diff (second)
no nrows, nor skiprows 9.355613674968481 9.48072951193899 -0.1251158369705081
nrows and skiprows (top lines) 8.50429745996371 8.566174849052913 -0.06187738908920437
nrows and skiprows (middle lines) 9.003354059066623 9.093110702931881 -0.0897566438652575
nrows and skiprows (bottom lines) 9.281007815967314 9.35020213900134 -0.06919432303402573
AVERAGE -0.08648604823974892
xlsx format + + +
main branch perf branch diff (second)
no nrows, nor skiprows 47.444979660911486 47.119721286930144 +0.3252583739813417
nrows and skiprows (top lines) 25.616206042002887 25.10350460803602 +0.5127014339668676
nrows and skiprows (middle lines) 39.18539171805605 38.66645851393696 +0.5189332041190937
nrows and skiprows (bottom lines) 46.56151840998791 45.87342134909704 +0.6880970608908683
AVERAGE +0.5112475182395428
xlsb format - - -
main branch perf branch diff (second)
no nrows, nor skiprows 9.378919824026525 9.404310946003534 -0.02539112197700888
nrows and skiprows (top lines) 8.567789324908517 8.607287417980842 -0.03949809307232499
nrows and skiprows (middle lines) 9.089394484995864 9.141500656027347 -0.05210617103148252
nrows and skiprows (bottom lines) 9.342884571058676 9.409172061015852 -0.06628748995717615
AVERAGE -0.045820719009498134
ods format - - -
main branch perf branch diff (second)
no nrows, nor skiprows 9.38043470599223 9.435379554051906 -0.05494484805967659
nrows and skiprows (top lines) 8.553725612931885 8.641130893025547 -0.08740528009366244
nrows and skiprows (middle lines) 8.839795237989165 8.97894280392211 -0.13914756593294442
nrows and skiprows (bottom lines) 9.031556493951939 9.168779775965959 -0.1372232820140198
AVERAGE -0.10468024402507581

NOTE

Where openpyxl is beeing optimized, other readers's engine are not... but with 'realy low differences'.

attempts on perf optimisations for pandas read_excel
by descending the skiprows to get_sheet_data depending on the engine
openpyxl, xlrd, pyxlsb or odf...
@phofl
Copy link
Member

phofl commented Jun 15, 2022

Hi, thanks for trying this. I am not sure, if I understand your benchmarks correctly, but this looks like noise to me? You can convert your pr to draft, if you are not ready yet

@datapythonista datapythonista added Performance Memory or execution speed performance IO Excel read_excel, to_excel labels Jun 16, 2022
@Sanix-Darker Sanix-Darker marked this pull request as draft June 16, 2022 07:07
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 17, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Having pandas.read_excel FASTER (with an available proof of concept)
4 participants