Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame({
"A": [.1, .2, .3],
"B": ["a", "b", "c"]
})
df.to_excel("test.xlsx", index=False)
df.to_csv("test.csv", index=False)
with open("test.xlsx", "rb") as f:
df_excel_1 = pd.read_excel(f)
# the below line should fail or return an empty dataframe but doesn't
df_excel_2 = pd.read_excel(f)
# instead, we have that pd.testing.assert_frame_equal(df_excel_1, df_excel_2)
with open("test.csv", "r") as f:
df_csv_1 = pd.read_csv(f)
# the below line will fail with EmptyDataError, as expected
df_csv_2 = pd.read_csv(f)
Issue Description
read_excel does not honor the starting position of the stream.
The problem is likely in pandas.io.excel._base.py:580
which calls .seek(0)
576: if isinstance(self.handles.handle, self._workbook_class):
577: self.book = self.handles.handle
578: elif hasattr(self.handles.handle, "read"):
579: # N.B. xlrd.Book has a read attribute too
580: self.handles.handle.seek(0)
581: try:
582: self.book = self.load_workbook(self.handles.handle, engine_kwargs)
583: except Exception:
584: self.close()
585: raise
586: else:
587: raise ValueError(
588: "Must explicitly set engine if not passing in buffer or path for io."
589: )
Why this is incorrect behaviour
I think calling seek(0) is anti-pattern in how streams are treated. For one, this ignores the current position of the stream for no apparent good reason. As second, a stream is not even required to implement a seek function to be considered a valid stream.
Hence, for example, you see other reports where people are passing a stream to read_excel that does not implement seek, such as, e.g. Requests response (#28825). The latter issue appears to be resolved, but looks like it's been resolved by considering the request Response as a special case that is handled differently.
Path to solution
First would be to remove seek(0) call altogether.
If load_workbook requires the stream to be seekable, however, the handle passed to load_workbook should be wrapped into a BytesIO wrapper. In either case, the stream should be read till the end and/or until a terminating condition is met (I don't know enough about Excel internals) and then left there, and left to the subsequent consumer to either continue or to the caller to close the stream.
As an aside, the call to close the stream should only happen if the ExcelReader is the one that opened it. Otherwise, it should be the caller's responsibility to close it.
Expected Behavior
read_excel should start reading bytes from the current position of the stream and should not reset it.
Installed Versions
INSTALLED VERSIONS
------------------
commit : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673
python : 3.11.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United Kingdom.1252
pandas : 2.1.4
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.3.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None