PERF: pandas read_excel perf optimisations #47376

Sanix-Darker · 2022-06-15T19:55:11Z

WHAT

Attempts on perf optimisations for pandas read_excel
by descending the skiprows (as an integer) to get_sheet_data depending on the engine
openpyxl, xlrd, pyxlsb or odf...

STATUS : Working in progress...

closes ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

BENCHMARKS

BENCH SPECS

OS: Ubuntu 20.04.4 LTS x86_64
Host: 20TH0010FR ThinkPad P1 Gen 3
Kernel: 5.14.0-1038-oem
CPU: Intel i7-10750H (12) @ 5.000GHz
GPU: NVIDIA Quadro T1000 Mobile
GPU: Intel UHD Graphics
Memory: 47849MiB
Python 3.8.6 - [GCC 9.4.0]

BENCH SCRIPT

import pandas as pd
from timeit import default_timer

def bench_mark_func():
    print(f">>> {pd.__version__}")
    for ext in ["xls", "xlsx", "xlsb", "ods"]:
        print(f"\n[{ext}] no nrows, nor skiprows :")
        start = default_timer()
        for i in range(100):
            pd.read_excel(f"./fixtures/benchmark_5000.{ext}")
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] with nrows and skiprows (reading top lines):")
        start = default_timer()
        for i in range(100):
            pd.read_excel(
                f"./fixtures/benchmark_5000.{ext}", nrows=50 * i, skiprows=100 + i
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] with nrows and skiprows (reading middle lines):")
        start = default_timer()
        for i in range(100):
            pd.read_excel(
                f"./fixtures/benchmark_5000.{ext}", nrows=50 * i, skiprows=2000 + i
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] with nrows and skiprows (reading bottom lines):")
        start = default_timer()
        for i in range(100):
            pd.read_excel(
                f"./fixtures/benchmark_5000.{ext}", nrows=50 * i, skiprows=4000 + i
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print("==" * 30)


if __name__ == "__main__":
    bench_mark_func()

Fixtures are available here : fixtures

BENCH REPORTS

xls format - - -

	main branch	perf branch	diff (second)
no nrows, nor skiprows	9.355613674968481	9.48072951193899	-0.1251158369705081
nrows and skiprows (top lines)	8.50429745996371	8.566174849052913	-0.06187738908920437
nrows and skiprows (middle lines)	9.003354059066623	9.093110702931881	-0.0897566438652575
nrows and skiprows (bottom lines)	9.281007815967314	9.35020213900134	-0.06919432303402573
		AVERAGE	-0.08648604823974892

xlsx format + + +

	main branch	perf branch	diff (second)
no nrows, nor skiprows	47.444979660911486	47.119721286930144	+0.3252583739813417
nrows and skiprows (top lines)	25.616206042002887	25.10350460803602	+0.5127014339668676
nrows and skiprows (middle lines)	39.18539171805605	38.66645851393696	+0.5189332041190937
nrows and skiprows (bottom lines)	46.56151840998791	45.87342134909704	+0.6880970608908683
		AVERAGE	+0.5112475182395428

xlsb format - - -

	main branch	perf branch	diff (second)
no nrows, nor skiprows	9.378919824026525	9.404310946003534	-0.02539112197700888
nrows and skiprows (top lines)	8.567789324908517	8.607287417980842	-0.03949809307232499
nrows and skiprows (middle lines)	9.089394484995864	9.141500656027347	-0.05210617103148252
nrows and skiprows (bottom lines)	9.342884571058676	9.409172061015852	-0.06628748995717615
		AVERAGE	-0.045820719009498134

ods format - - -

	main branch	perf branch	diff (second)
no nrows, nor skiprows	9.38043470599223	9.435379554051906	-0.05494484805967659
nrows and skiprows (top lines)	8.553725612931885	8.641130893025547	-0.08740528009366244
nrows and skiprows (middle lines)	8.839795237989165	8.97894280392211	-0.13914756593294442
nrows and skiprows (bottom lines)	9.031556493951939	9.168779775965959	-0.1372232820140198
		AVERAGE	-0.10468024402507581

NOTE

Where openpyxl is beeing optimized, other readers's engine are not... but with 'realy low differences'.

attempts on perf optimisations for pandas read_excel by descending the skiprows to get_sheet_data depending on the engine openpyxl, xlrd, pyxlsb or odf...

phofl · 2022-06-15T21:56:01Z

Hi, thanks for trying this. I am not sure, if I understand your benchmarks correctly, but this looks like noise to me? You can convert your pr to draft, if you are not ready yet

github-actions · 2022-07-17T00:05:40Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2022-07-21T17:40:54Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

feat: pandas read_excel perf optimisations

b5999a4

attempts on perf optimisations for pandas read_excel by descending the skiprows to get_sheet_data depending on the engine openpyxl, xlrd, pyxlsb or odf...

datapythonista added Performance Memory or execution speed performance IO Excel read_excel, to_excel labels Jun 16, 2022

Sanix-Darker marked this pull request as draft June 16, 2022 07:07

github-actions bot added the Stale label Jul 17, 2022

mroeschke closed this Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: pandas read_excel perf optimisations #47376

PERF: pandas read_excel perf optimisations #47376

Sanix-Darker commented Jun 15, 2022 •

edited

Loading

phofl commented Jun 15, 2022

github-actions bot commented Jul 17, 2022

mroeschke commented Jul 21, 2022

PERF: pandas read_excel perf optimisations #47376

PERF: pandas read_excel perf optimisations #47376

Conversation

Sanix-Darker commented Jun 15, 2022 • edited Loading

WHAT

BENCHMARKS

BENCH SPECS

BENCH SCRIPT

BENCH REPORTS

xls format - - -

xlsx format + + +

xlsb format - - -

ods format - - -

NOTE

phofl commented Jun 15, 2022

github-actions bot commented Jul 17, 2022

mroeschke commented Jul 21, 2022

Sanix-Darker commented Jun 15, 2022 •

edited

Loading