ENH: Optimize nrows in read_excel #33281

mproszewska · 2020-04-04T01:10:43Z

closes read_excel opimize nrows #32727
tests passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
If header, skiprows and nrows are integers, rows that will be skipped are not loaded in get_sheet_data function.

mproszewska · 2020-04-04T13:04:37Z

Tests which are not connected to that PR fail.

WillAyd

Thanks for the PR. Can you add tests?

pandas/io/excel/_base.py

mproszewska · 2020-04-06T13:49:10Z

But tests for readers already exist and include header, skiprows and nrows variables. Should I add something more?

pandas/io/excel/_xlrd.py

alimcmaster1 · 2020-04-11T01:50:14Z

Mind fixing up the code checks @mproszewska and I agree with @mroeschke comments we should find a way to get rid of duplicate code

jbrockmendel · 2020-04-16T17:30:03Z

The CI is failing because of an unused import of _validate_integer

pep8speaks · 2020-04-16T23:43:17Z

Hello @mproszewska! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-08 17:08:14 UTC

pandas/io/excel/_base.py

pandas/core/sorting.py

pandas/io/excel/_base.py

jreback · 2020-05-31T22:17:45Z

pandas/io/excel/_odfreader.py

@@ -80,6 +87,16 @@ def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
        table: List[List[Scalar]] = []

        for i, sheet_row in enumerate(sheet_rows):
+
+            should_continue, should_break = self.should_skip_row(


I am not sure this interface is obvious at all. prefer simple routines

if should_skip_rows(...):
....

when is a break condition? can you just do it here?

I changed it. Break condition is now handled differently

WillAyd · 2020-06-03T23:40:30Z

Can you add a benchmark for this to show performance improvement?

WillAyd · 2020-06-07T21:30:19Z

@mproszewska can you post the ASV results here as a comment?

mproszewska · 2020-06-08T17:08:10Z

[ 75.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                   ok
[ 75.00%] ··· ========== ===========
                engine              
              ---------- -----------
                 xlrd     1.92±0.1s 
               openpyxl   4.69±0.5s 
                 odf       11.3±1s  
              ========== ===========

[ 75.00%] · For pandas commit f0a4e8e5 <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Setting up io/excel.py:62                                                  ok
[100.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                   ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd     2.11±0.08s 
               openpyxl   4.65±0.5s  
                 odf      14.5±0.1s  
              ========== ============

WillAyd · 2020-07-08T16:20:27Z

Can you fix merge master and fix merge conflicts? That may also fix CI

Can you post the output of a continuous asv run instead of just the snippet? It should provide results across multiple runs at the end

simonjayhawkins · 2020-07-24T12:34:47Z

@mproszewska can you address comments?

WillAyd · 2020-08-18T19:44:28Z

Closing as I think this is stale but @mproszewska ping if you'd like to pick back up and can fix merge conflicts

mproszewska added 7 commits March 27, 2020 19:56

ENH: Skip rows while reading excel file with engine=openpyxl

900afff

ENH: Skiping rows with odf engine

df55b51

ENH: Optimize nrows in read_excel

8177024

Reformatted

79b34c3

Fix linting

f0a2b8d

Add annotation to variable

70ac234

Add imports

27cae3a

WillAyd requested changes Apr 4, 2020

View reviewed changes

pandas/io/excel/_base.py Outdated Show resolved Hide resolved

WillAyd added the IO Excel read_excel, to_excel label Apr 4, 2020

Add types

4248f8c

mproszewska requested a review from WillAyd April 4, 2020 23:22

mproszewska added 2 commits April 9, 2020 20:37

ENH: Fix

70f46b3

ENH: Mark variables as optional

cdfc05d

mroeschke reviewed Apr 9, 2020

View reviewed changes

pandas/io/excel/_xlrd.py Outdated Show resolved Hide resolved

mroeschke reviewed Apr 9, 2020

View reviewed changes

pandas/io/excel/_xlrd.py Outdated Show resolved Hide resolved

mproszewska added 3 commits April 9, 2020 22:35

Merge branch 'master' into excel

502b5e3

ENH: Move nrows variable check

4c8a42a

ENH: Remove unused imports

19bb927

alimcmaster1 added the Performance Memory or execution speed performance label Apr 11, 2020

mproszewska added 2 commits April 17, 2020 01:36

ENH: Move repeated code to base

6c2a3b5

ENH: Remove import

b865c88

mproszewska added 2 commits April 17, 2020 01:43

ENH: Lint

49276da

ENH: Lint

393a622

mroeschke reviewed Apr 17, 2020

View reviewed changes

pandas/io/excel/_base.py Outdated Show resolved Hide resolved

ENH: Add docstring to should_read_row

e00fff1

mproszewska added 5 commits May 23, 2020 00:28

Remove asv

2766270

Merge branch 'perf'

91176ca

Merge remote-tracking branch 'upstream/master'

f748b78

Resolve conflict

ac823f5

Resolve conflict

f4a805d

jreback requested changes May 31, 2020

View reviewed changes

mproszewska added 7 commits June 1, 2020 01:40

Revert change

6f188fe

Change should_skip_row function

ba314fe

Fix return type

f923bfd

Remove import

008add5

Merge remote-tracking branch 'upstream/master'

c04c494

Resolve conflict

596806c

Run tests

2226050

mproszewska added 6 commits June 5, 2020 02:54

Add asv

9216210

Add asv

d9aa319

Merge branch 'master' into excel

094d5f7

Resolve conflict

234dcc6

Fix

0afb1b1

Merge branch 'master' into excel

33cb733

mproszewska added 3 commits June 8, 2020 17:22

Fix asv

06003a8

Fix asv

c08709b

Fix asv

c9a2c75

WillAyd closed this Aug 18, 2020

MarcoGorelli mentioned this pull request Aug 29, 2020

ENH: Optimize nrows in read_excel #35974

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Optimize nrows in read_excel #33281

ENH: Optimize nrows in read_excel #33281

mproszewska commented Apr 4, 2020

mproszewska commented Apr 4, 2020

WillAyd left a comment

mproszewska commented Apr 6, 2020

alimcmaster1 commented Apr 11, 2020

jbrockmendel commented Apr 16, 2020

pep8speaks commented Apr 16, 2020 •

edited

Loading

jreback May 31, 2020

mproszewska Jun 1, 2020

WillAyd commented Jun 3, 2020

WillAyd commented Jun 7, 2020

mproszewska commented Jun 8, 2020

WillAyd commented Jul 8, 2020

simonjayhawkins commented Jul 24, 2020

WillAyd commented Aug 18, 2020

ENH: Optimize nrows in read_excel #33281

ENH: Optimize nrows in read_excel #33281

Conversation

mproszewska commented Apr 4, 2020

mproszewska commented Apr 4, 2020

WillAyd left a comment

Choose a reason for hiding this comment

mproszewska commented Apr 6, 2020

alimcmaster1 commented Apr 11, 2020

jbrockmendel commented Apr 16, 2020

pep8speaks commented Apr 16, 2020 • edited Loading

Comment last updated at 2020-06-08 17:08:14 UTC

jreback May 31, 2020

Choose a reason for hiding this comment

mproszewska Jun 1, 2020

Choose a reason for hiding this comment

WillAyd commented Jun 3, 2020

WillAyd commented Jun 7, 2020

mproszewska commented Jun 8, 2020

WillAyd commented Jul 8, 2020

simonjayhawkins commented Jul 24, 2020

WillAyd commented Aug 18, 2020

pep8speaks commented Apr 16, 2020 •

edited

Loading