You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I loaded this sample spreadsheet in all five filetypes to see what shape is returned:
filetype
pandas 1.2.x
master
xls
(3, 3)
(3, 3)
xlsx
(11, 6)
(3, 6)
xlsm
(11, 6)
(3, 6)
xlsb
(11, 6)
(11, 6)
ods
(3, 3)
(3, 3)
Expected Output
(3, 3)
Problem description
Spreadsheet files can contain cells with no values, such as the formatted blank cells shown below. In the worst cases, the cell used to contain a value or formatting but currently contains neither and is thus invisible to the user of the spreadsheet application. The trailing rows of NaNs in xls[x|m] files were recently resolved in #39547 by @rhshadrach but the additional columns still remain. In a case of the size pictured here, it's a minor annoyance, but I've come across spreadsheets in the wild where one more cells were unintentionally created on row 2**20 or column 2**14, which can create a severe performance issues. To fix the performance issue, the empty rows/columns must be trimmed before the data is passed to TextParser.
I suspect that these bug reports are related: #40569 #40976
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 1e91282
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-50-generic
Version : #56-Ubuntu SMP Mon Apr 12 17:18:36 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Test files available here.
I loaded this sample spreadsheet in all five filetypes to see what shape is returned:
Expected Output
(3, 3)
Problem description
Spreadsheet files can contain cells with no values, such as the formatted blank cells shown below. In the worst cases, the cell used to contain a value or formatting but currently contains neither and is thus invisible to the user of the spreadsheet application. The trailing rows of NaNs in xls[x|m] files were recently resolved in #39547 by @rhshadrach but the additional columns still remain. In a case of the size pictured here, it's a minor annoyance, but I've come across spreadsheets in the wild where one more cells were unintentionally created on row
2**20
or column2**14
, which can create a severe performance issues. To fix the performance issue, the empty rows/columns must be trimmed before the data is passed to TextParser.I suspect that these bug reports are related:
#40569
#40976
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 1e91282
python : 3.8.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-50-generic
Version : #56-Ubuntu SMP Mon Apr 12 17:18:36 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8
pandas : 1.3.0.dev0+1412.g1e912821c0
numpy : 1.19.5
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.6.0.post20210108
Cython : 0.29.21
pytest : 6.2.2
hypothesis : 6.1.1
sphinx : 3.4.3
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : 4.6.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 0.8.5
fastparquet : 0.5.0
gcsfs : 0.7.1
matplotlib : 3.3.4
numexpr : 2.7.2
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : 1.0.8
s3fs : 0.5.2
scipy : 1.6.0
sqlalchemy : 1.3.23
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.52.0
The text was updated successfully, but these errors were encountered: