Skip to content

Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods #32470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marlosb opened this issue Mar 5, 2020 · 4 comments
Assignees
Labels
Error Reporting Incorrect or improved errors from pandas IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@marlosb
Copy link

marlosb commented Mar 5, 2020

Problem description

When trying to access a non-existing file on a private S3 bucket, using appropriated credentials, pandas will throw a ConnectionTimeoutError instead of a FileNotFoundError.
I tested the following read_ methods and got same behavior in all: read_csv, read_excel, read_fwf, read_json, read_msgpack, read_parquet,

You got to have right credentials and try to access a private bucket to reproduce it. The s3fs library implicitly reads credentials from environment variables if you have it named properly (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).

Code to reproduce the issue

import pandas
import s3fs

fs = s3fs.S3FileSystem()
file_name = 's3://<my_bucket>/<my_file>'
pandas.read_csv(file_name)

Root Cause

I checked source files and found the following excerpt from pandas/io/s3.py

     fs = s3fs.S3FileSystem(anon=False)
    try:
        filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    except (compat.FileNotFoundError, NoCredentialsError):
        # boto3 has troubles when trying to access a public file
        # when credentialed...
        # An OSError is raised if you have credentials, but they
        # aren't valid for that bucket.
        # A NoCredentialsError is raised if you don't have creds
        # for that bucket.
        fs = s3fs.S3FileSystem(anon=True)
        filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    return filepath_or_buffer, None, compression, True

Pandas is trying to workaround an issue of boto3 and when receiving FileNotFoundError or NoCredentialsError it chances connection credentials to anonymous (see the anon=True argument while creating object fs from class S3FileSystem) and try it again.
This help accessing public buckets/files while using S3 credentials, as comment states. However, when trying to access a non-existing file in a private bucket S3 will not return a file not found it will let connection timeout. They probably do that for security reasons, not disclosure private file existence without proper credentials.

Why this is an issue

Instead of quiclky throwns a FileNotFoundError it takes a while and then throwns a ConnectionTimeOutError.

Double check the cause

Just to make sure this is the cause I tried the code below. However, I don't really understand all implications to check if it is a acceptable solution.

    fs = s3fs.S3FileSystem(anon=False)
    try:
        filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    except (compat.FileNotFoundError, NoCredentialsError):
        # boto3 has troubles when trying to access a public file
        # when credentialed...
        # An OSError is raised if you have credentials, but they
        # aren't valid for that bucket.
        # A NoCredentialsError is raised if you don't have creds
        # for that bucket. 

        if fs.exists(filepath_or_buffer[0:filepath_or_buffer.rfind('/')]):
            raise FileNotFoundError(filepath_or_buffer, 'not found')
        else: 

            fs = s3fs.S3FileSystem(anon=True)
            filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    return filepath_or_buffer, None, compression, True

I just test if credentials can access up directory. If yes It conclude this is not a credentials issue and thrown a FileNotFoundError.

Expected Output

Pandas read methods thrown FileNotFoundError when file doesn't exist on private S3 buckets.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : None
pyxlsb : None
s3fs : 0.4.0
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@marlosb marlosb changed the title Mishandling exception when trying to access inexistent private files at S3 using several pandas.read_ methods Mishandling exception when trying to access inexistent files at s private S3 bucket using several pandas.read_ methods Mar 5, 2020
@marlosb marlosb changed the title Mishandling exception when trying to access inexistent files at s private S3 bucket using several pandas.read_ methods Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods Mar 5, 2020
@marlosb
Copy link
Author

marlosb commented Mar 5, 2020

Just to add a few more details. I see that this commit (d6fe194#diff-a37b395bed03f0404dec864a4529c97d) made FileNotFoundError error to type to be handled. I don't know the reason so I don't know if it can be rolled back.

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Apr 20, 2020
@jreback jreback added this to the 1.1 milestone Apr 20, 2020
@jreback jreback added the Error Reporting Incorrect or improved errors from pandas label Apr 20, 2020
@alimcmaster1
Copy link
Member

I can't actually reproduce this.

As per your description a non existent file on priv s3 bucket that I have appropriate credentials to:

import pandas as pd
pd.read_csv("s3://dev/this_file_doesnt_exist.csv")

consistently throws a PermissionError for me

Anything else @marlosb I can do to repo?

pandas==1.0.3
s3fs==0.4.2

@alimcmaster1 alimcmaster1 self-assigned this Apr 21, 2020
@marlosb
Copy link
Author

marlosb commented Apr 21, 2020

Thanks for self-assigning and get in touch Alistair .

I couldn't reproduce the problem either.
I'm in using same versions of all libraries from the report. The only difference is that now I'm home and when I was facing the problem I was in corporate network, behind a proxy server. Something I can't reproduce now. Tried external proxy server without success but not sure if it is same behavior. Anyways, if the problem happens only in a specific networkI would say it is network fault.

Also, maybe the S3 behavior have changed since the report was submitted. I'm not sure if it is likely to happen.

I'm sorry for submitting a report that I can reproduce. Next time I'll test from several networks before submit.

@marlosb marlosb closed this as completed Apr 21, 2020
@alimcmaster1
Copy link
Member

No worries @marlosb - thanks for following up and feel free to re open if required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

3 participants