Skip to content

BUG: read_excel trailing blank rows and columns #41227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 6, 2021

Conversation

ahawryluk
Copy link
Contributor

@ahawryluk ahawryluk marked this pull request as ready for review April 30, 2021 01:36
@phofl
Copy link
Member

phofl commented May 1, 2021

Not an expert here, but could you run asvs for read_excel?

@ahawryluk
Copy link
Contributor Author

@phofl here are the asvs on my machine (asv run -E existing --bench ReadExcel)

engine this branch master
xlrd 35.8±0.1ms 36.1±0.1ms
openpyxl 166±3ms 165±0.4ms
odf 678±5ms 692±4ms

The asv test data has no trailing cells, so we don't see a measurable impact. I also tested both branches on a sample .xlsx file with 1000 rows × 2 columns and a single formatted cell on row 0, column 2**14.
blank_cell_XFD1.xlsx

master 2.84 s
this branch 48.9 ms
(both times best of 5)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment.

cc @rhshadrach ok here?

@@ -799,6 +799,7 @@ I/O
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
- Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
- Bug in :func:`read_excel` loading trailing empty rows/columns for some filetypes (:issue:`41167`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put netx to this one (or combine as i think they are basically the same)

Bug in :func:read_excel dropping empty values from single-column spreadsheets (:issue:39808)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the whatsnew line, but left the two items separate since one bug dropped NaNs within the data and the other bug loaded extra NaNs outside the data. Thanks for reviewing this.

@jreback jreback added the IO Excel read_excel, to_excel label May 5, 2021
@jreback jreback added this to the 1.3 milestone May 5, 2021
@jreback jreback added the Bug label May 5, 2021
@jreback jreback merged commit ae5fe34 into pandas-dev:master May 6, 2021
@jreback
Copy link
Contributor

jreback commented May 6, 2021

thanks @ahawryluk

@rhshadrach
Copy link
Member

lgtm thanks @ahawryluk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: some read_excel engines still load trailing blank cells
4 participants