Skip to content

ENH: Optimize nrows in read_excel #33281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 79 commits into from

Conversation

mproszewska
Copy link
Contributor

  • closes read_excel opimize nrows #32727
  • tests passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • If header, skiprows and nrows are integers, rows that will be skipped are not loaded in get_sheet_data function.

@mproszewska
Copy link
Contributor Author

Tests which are not connected to that PR fail.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Can you add tests?

@WillAyd WillAyd added the IO Excel read_excel, to_excel label Apr 4, 2020
@mproszewska mproszewska requested a review from WillAyd April 4, 2020 23:22
@mproszewska
Copy link
Contributor Author

But tests for readers already exist and include header, skiprows and nrows variables. Should I add something more?

@alimcmaster1 alimcmaster1 added the Performance Memory or execution speed performance label Apr 11, 2020
@alimcmaster1
Copy link
Member

Mind fixing up the code checks @mproszewska and I agree with @mroeschke comments we should find a way to get rid of duplicate code

@jbrockmendel
Copy link
Member

The CI is failing because of an unused import of _validate_integer

@pep8speaks
Copy link

pep8speaks commented Apr 16, 2020

Hello @mproszewska! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-08 17:08:14 UTC

@@ -80,6 +87,16 @@ def get_sheet_data(self, sheet, convert_float: bool) -> List[List[Scalar]]:
table: List[List[Scalar]] = []

for i, sheet_row in enumerate(sheet_rows):

should_continue, should_break = self.should_skip_row(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this interface is obvious at all. prefer simple routines

if should_skip_rows(...):
....

when is a break condition? can you just do it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it. Break condition is now handled differently

@WillAyd
Copy link
Member

WillAyd commented Jun 3, 2020

Can you add a benchmark for this to show performance improvement?

@WillAyd
Copy link
Member

WillAyd commented Jun 7, 2020

@mproszewska can you post the ASV results here as a comment?

@mproszewska
Copy link
Contributor Author

[ 75.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                   ok
[ 75.00%] ··· ========== ===========
                engine              
              ---------- -----------
                 xlrd     1.92±0.1s 
               openpyxl   4.69±0.5s 
                 odf       11.3±1s  
              ========== ===========

[ 75.00%] · For pandas commit f0a4e8e5 <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 75.00%] ·· Benchmarking conda-py3.6-Cython0.29.16-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· Setting up io/excel.py:62                                                  ok
[100.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                   ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd     2.11±0.08s 
               openpyxl   4.65±0.5s  
                 odf      14.5±0.1s  
              ========== ============

@WillAyd
Copy link
Member

WillAyd commented Jul 8, 2020

Can you fix merge master and fix merge conflicts? That may also fix CI

Can you post the output of a continuous asv run instead of just the snippet? It should provide results across multiple runs at the end

@simonjayhawkins
Copy link
Member

@mproszewska can you address comments?

@WillAyd
Copy link
Member

WillAyd commented Aug 18, 2020

Closing as I think this is stale but @mproszewska ping if you'd like to pick back up and can fix merge conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_excel opimize nrows
9 participants