You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
importpandasaspddf=pd.DataFrame({
"A": [.1, .2, .3],
"B": ["a", "b", "c"]
})
df.to_excel("test.xlsx", index=False)
df.to_csv("test.csv", index=False)
withopen("test.xlsx", "rb") asf:
df_excel_1=pd.read_excel(f)
# the below line should fail or return an empty dataframe but doesn'tdf_excel_2=pd.read_excel(f)
# instead, we have that pd.testing.assert_frame_equal(df_excel_1, df_excel_2)withopen("test.csv", "r") asf:
df_csv_1=pd.read_csv(f)
# the below line will fail with EmptyDataError, as expecteddf_csv_2=pd.read_csv(f)
Issue Description
read_excel does not honor the starting position of the stream.
The problem is likely in pandas.io.excel._base.py:580 which calls .seek(0)
576: ifisinstance(self.handles.handle, self._workbook_class):
577: self.book=self.handles.handle578: elifhasattr(self.handles.handle, "read"):
579: # N.B. xlrd.Book has a read attribute too580: self.handles.handle.seek(0)
581: try:
582: self.book=self.load_workbook(self.handles.handle, engine_kwargs)
583: exceptException:
584: self.close()
585: raise586: else:
587: raiseValueError(
588: "Must explicitly set engine if not passing in buffer or path for io."589: )
Why this is incorrect behaviour
I think calling seek(0) is anti-pattern in how streams are treated. For one, this ignores the current position of the stream for no apparent good reason. As second, a stream is not even required to implement a seek function to be considered a valid stream.
Hence, for example, you see other reports where people are passing a stream to read_excel that does not implement seek, such as, e.g. Requests response (#28825). The latter issue appears to be resolved, but looks like it's been resolved by considering the request Response as a special case that is handled differently.
Path to solution
First would be to remove seek(0) call altogether.
If load_workbook requires the stream to be seekable, however, the handle passed to load_workbook should be wrapped into a BytesIO wrapper. In either case, the stream should be read till the end and/or until a terminating condition is met (I don't know enough about Excel internals) and then left there, and left to the subsequent consumer to either continue or to the caller to close the stream.
As an aside, the call to close the stream should only happen if the ExcelReader is the one that opened it. Otherwise, it should be the caller's responsibility to close it.
Expected Behavior
read_excel should start reading bytes from the current position of the stream and should not reset it.
Sure - if this can be removed and still pass the test suite I would be OK with removing seek. There might be some historical cruft to that being in the there in the first place.
Hi @WillAyd, I've been playing around (I've even removed the seek line), and this doesn't seem to be a pandas bug. If you use openpyx1 to read the file, you get to read it as many times as you want without reopening the workbook:
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
read_excel does not honor the starting position of the stream.
The problem is likely in
pandas.io.excel._base.py:580
which calls .seek(0)Why this is incorrect behaviour
I think calling seek(0) is anti-pattern in how streams are treated. For one, this ignores the current position of the stream for no apparent good reason. As second, a stream is not even required to implement a seek function to be considered a valid stream.
Hence, for example, you see other reports where people are passing a stream to read_excel that does not implement seek, such as, e.g. Requests response (#28825). The latter issue appears to be resolved, but looks like it's been resolved by considering the request Response as a special case that is handled differently.
Path to solution
First would be to remove seek(0) call altogether.
If load_workbook requires the stream to be seekable, however, the handle passed to load_workbook should be wrapped into a BytesIO wrapper. In either case, the stream should be read till the end and/or until a terminating condition is met (I don't know enough about Excel internals) and then left there, and left to the subsequent consumer to either continue or to the caller to close the stream.
As an aside, the call to close the stream should only happen if the ExcelReader is the one that opened it. Otherwise, it should be the caller's responsibility to close it.
Expected Behavior
read_excel should start reading bytes from the current position of the stream and should not reset it.
Installed Versions
The text was updated successfully, but these errors were encountered: