BUG: read_excel trailing blank rows and columns #41227

ahawryluk · 2021-04-29T23:13:08Z

closes BUG: some read_excel engines still load trailing blank cells #41167
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

phofl · 2021-05-01T15:01:56Z

Not an expert here, but could you run asvs for read_excel?

ahawryluk · 2021-05-03T22:44:21Z

@phofl here are the asvs on my machine (asv run -E existing --bench ReadExcel)

engine	this branch	master
xlrd	35.8±0.1ms	36.1±0.1ms
openpyxl	166±3ms	165±0.4ms
odf	678±5ms	692±4ms

The asv test data has no trailing cells, so we don't see a measurable impact. I also tested both branches on a sample .xlsx file with 1000 rows × 2 columns and a single formatted cell on row 0, column 2**14.
blank_cell_XFD1.xlsx

master 2.84 s
this branch 48.9 ms
(both times best of 5)

jreback

small comment.

cc @rhshadrach ok here?

jreback · 2021-05-05T13:05:19Z

doc/source/whatsnew/v1.3.0.rst

@@ -799,6 +799,7 @@ I/O
 - Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
 - Bug in :func:`read_orc` always raising ``AttributeError`` (:issue:`40918`)
 - Bug in the conversion from pyarrow to pandas (e.g. for reading Parquet) with nullable dtypes and a pyarrow array whose data buffer size is not a multiple of dtype size (:issue:`40896`)
+- Bug in :func:`read_excel` loading trailing empty rows/columns for some filetypes (:issue:`41167`)


can you put netx to this one (or combine as i think they are basically the same)

Bug in :func:read_excel dropping empty values from single-column spreadsheets (:issue:39808)

I moved the whatsnew line, but left the two items separate since one bug dropped NaNs within the data and the other bug loaded extra NaNs outside the data. Thanks for reviewing this.

jreback · 2021-05-06T01:41:30Z

thanks @ahawryluk

rhshadrach · 2021-05-06T02:17:49Z

lgtm thanks @ahawryluk

ahawryluk added 4 commits April 26, 2021 10:31

Add test_trailing_blanks, which currently fails

1e91282

Trim trailing blank cells from xlsx/m and xlsb

8cf8d94

whatsnew entry

6218039

Merge branch 'master' into trailing_blanks

254aa45

ahawryluk marked this pull request as ready for review April 30, 2021 01:36

jreback requested changes May 5, 2021

View reviewed changes

jreback added the IO Excel read_excel, to_excel label May 5, 2021

jreback added this to the 1.3 milestone May 5, 2021

jreback added the Bug label May 5, 2021

Move whatsnew item

0962d21

jreback approved these changes May 6, 2021

View reviewed changes

jreback merged commit ae5fe34 into pandas-dev:master May 6, 2021

ahawryluk deleted the trailing_blanks branch May 6, 2021 04:00

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

BUG: read_excel trailing blank rows and columns (pandas-dev#41227)

9436a60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_excel trailing blank rows and columns #41227

BUG: read_excel trailing blank rows and columns #41227

ahawryluk commented Apr 29, 2021

phofl commented May 1, 2021

ahawryluk commented May 3, 2021

jreback left a comment

jreback May 5, 2021

ahawryluk May 5, 2021

jreback commented May 6, 2021

rhshadrach commented May 6, 2021

BUG: read_excel trailing blank rows and columns #41227

BUG: read_excel trailing blank rows and columns #41227

Conversation

ahawryluk commented Apr 29, 2021

phofl commented May 1, 2021

ahawryluk commented May 3, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback May 5, 2021

Choose a reason for hiding this comment

ahawryluk May 5, 2021

Choose a reason for hiding this comment

jreback commented May 6, 2021

rhshadrach commented May 6, 2021