BUG: some read_excel engines still load trailing blank cells #41167

ahawryluk · 2021-04-26T18:26:31Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

pd.read_excel('trailing_blanks.xlsx').shape

Test files available here.

I loaded this sample spreadsheet in all five filetypes to see what shape is returned:

filetype	pandas 1.2.x	master
xls	(3, 3)	(3, 3)
xlsx	(11, 6)	(3, 6)
xlsm	(11, 6)	(3, 6)
xlsb	(11, 6)	(11, 6)
ods	(3, 3)	(3, 3)

Expected Output

(3, 3)

Problem description

Spreadsheet files can contain cells with no values, such as the formatted blank cells shown below. In the worst cases, the cell used to contain a value or formatting but currently contains neither and is thus invisible to the user of the spreadsheet application. The trailing rows of NaNs in xls[x|m] files were recently resolved in #39547 by @rhshadrach but the additional columns still remain. In a case of the size pictured here, it's a minor annoyance, but I've come across spreadsheets in the wild where one more cells were unintentionally created on row 2**20 or column 2**14, which can create a severe performance issues. To fix the performance issue, the empty rows/columns must be trimmed before the data is passed to TextParser.

I suspect that these bug reports are related:
#40569
#40976

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 1e91282
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-50-generic
Version : #56-Ubuntu SMP Mon Apr 12 17:18:36 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.3.0.dev0+1412.g1e912821c0
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.2
hypothesis : 6.1.1
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : 0.7.1
matplotlib : 3.3.4
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : 1.0.8
s3fs : 0.5.2
scipy : 1.6.0
sqlalchemy : 1.3.23
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0

The text was updated successfully, but these errors were encountered:

ahawryluk · 2021-04-26T19:16:59Z

take

ahawryluk added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 26, 2021

github-actions bot assigned ahawryluk Apr 26, 2021

lithomas1 added IO Excel read_excel, to_excel Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 27, 2021

ahawryluk mentioned this issue Apr 29, 2021

BUG: read_excel trailing blank rows and columns #41227

Merged

4 tasks

jreback added this to the 1.3 milestone May 5, 2021

jreback closed this as completed in #41227 May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: some read_excel engines still load trailing blank cells #41167

BUG: some read_excel engines still load trailing blank cells #41167

ahawryluk commented Apr 26, 2021

INSTALLED VERSIONS

ahawryluk commented Apr 26, 2021

BUG: some read_excel engines still load trailing blank cells #41167

BUG: some read_excel engines still load trailing blank cells #41167

Comments

ahawryluk commented Apr 26, 2021

Code Sample, a copy-pastable example

Expected Output

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

ahawryluk commented Apr 26, 2021

Output of `pd.show_versions()`