-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_excel throws FileNotFoundError with s3fs objects #38788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
thank you for your report! Is there a public excel file on s3 so that I can test it quickly (edit: any public S3 file should be sufficient)? I assume that affects most
|
@twoertwein you can also test it with the mocked s3 filesystem used in the tests. I can reproduce the error with: --- a/pandas/tests/io/excel/test_readers.py
+++ b/pandas/tests/io/excel/test_readers.py
@@ -645,7 +645,7 @@ class TestReaders:
local_table = pd.read_excel("test1" + read_ext)
tm.assert_frame_equal(url_table, local_table)
- @td.skip_if_not_us_locale
def test_read_from_s3_url(self, read_ext, s3_resource, s3so):
# Bucket "pandas-test" created in tests/io/conftest.py
with open("test1" + read_ext, "rb") as f:
@@ -657,6 +657,21 @@ class TestReaders:
local_table = pd.read_excel("test1" + read_ext)
tm.assert_frame_equal(url_table, local_table)
+ def test_read_from_s3fs_object(self, read_ext, s3_resource, s3so):
+ # Bucket "pandas-test" created in tests/io/conftest.py
+ with open("test1" + read_ext, "rb") as f:
+ s3_resource.Bucket("pandas-test").put_object(Key="test1" + read_ext, Body=f)
+
+ import s3fs
+ s3 = s3fs.S3FileSystem(**s3so)
+
+ with s3.open("s3://pandas-test/test1" + read_ext) as f:
+ url_table = pd.read_excel(f)
+
+ local_table = pd.read_excel("test1" + read_ext)
+ tm.assert_frame_equal(url_table, local_table)
+
|
the issue is: pandas/pandas/io/excel/_base.py Line 1063 in e85d078
this issue should be limited to excel |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample
I get a FileNotFoundError when running the following code:
Problem description
I should be able to read in the File-like object from s3fs when using
pd.read_excel
orpd.ExcelFile
. Pandas 1.1.x allows for this, but it looks like changes topd.io.common.get_handle
in 1.2 have made this impossible. The simple workaround for this is to just use the s3 URI instead of usings3fs
to open it first, but to my knowledge, the ability to useread_excel
with an s3fs object was not intended to be deprecated in 1.2.My noob guess on what's going wrong
I'm new to contributing to open source projects, so I don't know exactly how to fix this, but it looks like the issue is that the
pd.io.common.get_handle
method in 1.2 thinks the s3fs object is a file handle rather than a file-like buffer. To solve this, I would think something similar to theneed_text_wrapping
boolean option from theget_handle
method in 1.1.x needs to be added to 1.2'sget_handle
in order to tell pandas that the s3fs object needs a TextIOWrapper rather than treating it like a local file handle.If someone could give me a little guidance on how to fix this, I'd be happy to give my first open-source contribution a go, but if that's not really how this works, I understand.
Expected Output
<class 'pandas.core.frame.DataFrame'>
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 3e89b4c
python : 3.8.0.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-197-generic
Version : #229-Ubuntu SMP Wed Nov 25 11:05:42 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : 0.5.2
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: