Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods #32470
Labels
Error Reporting
Incorrect or improved errors from pandas
IO Data
IO issues that don't fit into a more specific label
Milestone
Problem description
When trying to access a non-existing file on a private S3 bucket, using appropriated credentials, pandas will throw a ConnectionTimeoutError instead of a FileNotFoundError.
I tested the following read_ methods and got same behavior in all: read_csv, read_excel, read_fwf, read_json, read_msgpack, read_parquet,
You got to have right credentials and try to access a private bucket to reproduce it. The s3fs library implicitly reads credentials from environment variables if you have it named properly (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).
Code to reproduce the issue
Root Cause
I checked source files and found the following excerpt from pandas/io/s3.py
Pandas is trying to workaround an issue of boto3 and when receiving FileNotFoundError or NoCredentialsError it chances connection credentials to anonymous (see the anon=True argument while creating object fs from class S3FileSystem) and try it again.
This help accessing public buckets/files while using S3 credentials, as comment states. However, when trying to access a non-existing file in a private bucket S3 will not return a file not found it will let connection timeout. They probably do that for security reasons, not disclosure private file existence without proper credentials.
Why this is an issue
Instead of quiclky throwns a FileNotFoundError it takes a while and then throwns a ConnectionTimeOutError.
Double check the cause
Just to make sure this is the cause I tried the code below. However, I don't really understand all implications to check if it is a acceptable solution.
I just test if credentials can access up directory. If yes It conclude this is not a credentials issue and thrown a FileNotFoundError.
Expected Output
Pandas read methods thrown FileNotFoundError when file doesn't exist on private S3 buckets.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : None
pyxlsb : None
s3fs : 0.4.0
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: