Skip to content

BUG: some read_excel engines still load trailing blank cells #41167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
ahawryluk opened this issue Apr 26, 2021 · 1 comment · Fixed by #41227
Closed
3 tasks done

BUG: some read_excel engines still load trailing blank cells #41167

ahawryluk opened this issue Apr 26, 2021 · 1 comment · Fixed by #41227
Assignees
Labels
Bug IO Excel read_excel, to_excel Performance Memory or execution speed performance
Milestone

Comments

@ahawryluk
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

pd.read_excel('trailing_blanks.xlsx').shape

Test files available here.

trailing_blanks

I loaded this sample spreadsheet in all five filetypes to see what shape is returned:

filetype pandas 1.2.x master
xls (3, 3) (3, 3)
xlsx (11, 6) (3, 6)
xlsm (11, 6) (3, 6)
xlsb (11, 6) (11, 6)
ods (3, 3) (3, 3)

Expected Output

(3, 3)

Problem description

Spreadsheet files can contain cells with no values, such as the formatted blank cells shown below. In the worst cases, the cell used to contain a value or formatting but currently contains neither and is thus invisible to the user of the spreadsheet application. The trailing rows of NaNs in xls[x|m] files were recently resolved in #39547 by @rhshadrach but the additional columns still remain. In a case of the size pictured here, it's a minor annoyance, but I've come across spreadsheets in the wild where one more cells were unintentionally created on row 2**20 or column 2**14, which can create a severe performance issues. To fix the performance issue, the empty rows/columns must be trimmed before the data is passed to TextParser.

I suspect that these bug reports are related:
#40569
#40976

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 1e91282
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-50-generic
Version : #56-Ubuntu SMP Mon Apr 12 17:18:36 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.3.0.dev0+1412.g1e912821c0
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.2
hypothesis : 6.1.1
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : 0.7.1
matplotlib : 3.3.4
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : 1.0.8
s3fs : 0.5.2
scipy : 1.6.0
sqlalchemy : 1.3.23
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0

@ahawryluk ahawryluk added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 26, 2021
@ahawryluk
Copy link
Contributor Author

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants