Skip to content

BUG: Inconsistent stream processing between read_excel and read_csv #56959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
jf2 opened this issue Jan 19, 2024 · 2 comments
Open
2 of 3 tasks

BUG: Inconsistent stream processing between read_excel and read_csv #56959

jf2 opened this issue Jan 19, 2024 · 2 comments
Labels
Bug IO Excel read_excel, to_excel

Comments

@jf2
Copy link

jf2 commented Jan 19, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({
    "A": [.1, .2, .3],
    "B": ["a", "b", "c"]
})
df.to_excel("test.xlsx", index=False)
df.to_csv("test.csv", index=False)

with open("test.xlsx", "rb") as f:
    df_excel_1 = pd.read_excel(f)
    # the below line should fail or return an empty dataframe but doesn't
    df_excel_2 = pd.read_excel(f)

# instead, we have that pd.testing.assert_frame_equal(df_excel_1, df_excel_2)

with open("test.csv", "r") as f:
    df_csv_1 = pd.read_csv(f)
    # the below line will fail with EmptyDataError, as expected
    df_csv_2 = pd.read_csv(f)

Issue Description

read_excel does not honor the starting position of the stream.

The problem is likely in pandas.io.excel._base.py:580 which calls .seek(0)

576:    if isinstance(self.handles.handle, self._workbook_class):
577:        self.book = self.handles.handle
578:    elif hasattr(self.handles.handle, "read"):
579:        # N.B. xlrd.Book has a read attribute too
580:        self.handles.handle.seek(0)
581:        try:
582:            self.book = self.load_workbook(self.handles.handle, engine_kwargs)
583:        except Exception:
584:            self.close()
585:            raise
586:    else:
587:        raise ValueError(
588:            "Must explicitly set engine if not passing in buffer or path for io."
589:        )

Why this is incorrect behaviour

I think calling seek(0) is anti-pattern in how streams are treated. For one, this ignores the current position of the stream for no apparent good reason. As second, a stream is not even required to implement a seek function to be considered a valid stream.

Hence, for example, you see other reports where people are passing a stream to read_excel that does not implement seek, such as, e.g. Requests response (#28825). The latter issue appears to be resolved, but looks like it's been resolved by considering the request Response as a special case that is handled differently.

Path to solution

First would be to remove seek(0) call altogether.

If load_workbook requires the stream to be seekable, however, the handle passed to load_workbook should be wrapped into a BytesIO wrapper. In either case, the stream should be read till the end and/or until a terminating condition is met (I don't know enough about Excel internals) and then left there, and left to the subsequent consumer to either continue or to the caller to close the stream.

As an aside, the call to close the stream should only happen if the ExcelReader is the one that opened it. Otherwise, it should be the caller's responsibility to close it.

Expected Behavior

read_excel should start reading bytes from the current position of the stream and should not reset it.

Installed Versions

INSTALLED VERSIONS
------------------
commit              : a671b5a8bf5dd13fb19f0e88edc679bc9e15c673
python              : 3.11.3.final.0
python-bits         : 64
OS                  : Windows
OS-release          : 10
Version             : 10.0.19045
machine             : AMD64
processor           : Intel64 Family 6 Model 165 Stepping 5, GenuineIntel
byteorder           : little
LC_ALL              : None
LANG                : en_US.UTF-8
LOCALE              : English_United Kingdom.1252

pandas              : 2.1.4
numpy               : 1.26.3
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 65.5.0
pip                 : 23.3.2
Cython              : None
pytest              : None
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : None
html5lib            : None
pymysql             : None
psycopg2            : None
jinja2              : None
IPython             : None
pandas_datareader   : None
bs4                 : None
bottleneck          : None
dataframe-api-compat: None
fastparquet         : None
fsspec              : None
gcsfs               : None
matplotlib          : None
numba               : None
numexpr             : None
odfpy               : None
openpyxl            : 3.1.2
pandas_gbq          : None
pyarrow             : None
pyreadstat          : None
pyxlsb              : None
s3fs                : None
scipy               : None
sqlalchemy          : None
tables              : None
tabulate            : None
xarray              : None
xlrd                : None
zstandard           : None
tzdata              : 2023.4
qtpy                : None
pyqt5               : None
@jf2 jf2 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 19, 2024
@WillAyd
Copy link
Member

WillAyd commented Jan 19, 2024

Sure - if this can be removed and still pass the test suite I would be OK with removing seek. There might be some historical cruft to that being in the there in the first place.

Did you want to try that in a PR @jf2?

@WillAyd WillAyd added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 19, 2024
@jmarintur
Copy link
Contributor

Hi @WillAyd, I've been playing around (I've even removed the seek line), and this doesn't seem to be a pandas bug. If you use openpyx1 to read the file, you get to read it as many times as you want without reopening the workbook:

import openpyxl
import pandas as pd

df = pd.DataFrame({
    "A": [.0, .1],
    "B": ["a", "b"]
})
df.to_excel("test.xlsx", index=False)

wb = openpyxl.load_workbook("test.xlsx")
sheet = wb.active

rows_1 = list(sheet.values)
rows_2 = list(sheet.values)
rows_3 = list(sheet.values)

Could this be a feature on how excel files are handled? Wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
Development

No branches or pull requests

3 participants