Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods #32470

marlosb · 2020-03-05T21:06:26Z

Problem description

When trying to access a non-existing file on a private S3 bucket, using appropriated credentials, pandas will throw a ConnectionTimeoutError instead of a FileNotFoundError.
I tested the following read_ methods and got same behavior in all: read_csv, read_excel, read_fwf, read_json, read_msgpack, read_parquet,

You got to have right credentials and try to access a private bucket to reproduce it. The s3fs library implicitly reads credentials from environment variables if you have it named properly (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).

Code to reproduce the issue

import pandas
import s3fs

fs = s3fs.S3FileSystem()
file_name = 's3://<my_bucket>/<my_file>'
pandas.read_csv(file_name)

Root Cause

I checked source files and found the following excerpt from pandas/io/s3.py

     fs = s3fs.S3FileSystem(anon=False)
    try:
        filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    except (compat.FileNotFoundError, NoCredentialsError):
        # boto3 has troubles when trying to access a public file
        # when credentialed...
        # An OSError is raised if you have credentials, but they
        # aren't valid for that bucket.
        # A NoCredentialsError is raised if you don't have creds
        # for that bucket.
        fs = s3fs.S3FileSystem(anon=True)
        filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    return filepath_or_buffer, None, compression, True

Pandas is trying to workaround an issue of boto3 and when receiving FileNotFoundError or NoCredentialsError it chances connection credentials to anonymous (see the anon=True argument while creating object fs from class S3FileSystem) and try it again.
This help accessing public buckets/files while using S3 credentials, as comment states. However, when trying to access a non-existing file in a private bucket S3 will not return a file not found it will let connection timeout. They probably do that for security reasons, not disclosure private file existence without proper credentials.

Why this is an issue

Instead of quiclky throwns a FileNotFoundError it takes a while and then throwns a ConnectionTimeOutError.

Double check the cause

Just to make sure this is the cause I tried the code below. However, I don't really understand all implications to check if it is a acceptable solution.

    fs = s3fs.S3FileSystem(anon=False)
    try:
        filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    except (compat.FileNotFoundError, NoCredentialsError):
        # boto3 has troubles when trying to access a public file
        # when credentialed...
        # An OSError is raised if you have credentials, but they
        # aren't valid for that bucket.
        # A NoCredentialsError is raised if you don't have creds
        # for that bucket. 

        if fs.exists(filepath_or_buffer[0:filepath_or_buffer.rfind('/')]):
            raise FileNotFoundError(filepath_or_buffer, 'not found')
        else: 

            fs = s3fs.S3FileSystem(anon=True)
            filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
    return filepath_or_buffer, None, compression, True

I just test if credentials can access up directory. If yes It conclude this is not a credentials issue and thrown a FileNotFoundError.

Expected Output

Pandas read methods thrown FileNotFoundError when file doesn't exist on private S3 buckets.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : None
pyxlsb : None
s3fs : 0.4.0
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

marlosb · 2020-03-05T22:07:27Z

Just to add a few more details. I see that this commit (d6fe194#diff-a37b395bed03f0404dec864a4529c97d) made FileNotFoundError error to type to be handled. I don't know the reason so I don't know if it can be rolled back.

alimcmaster1 · 2020-04-21T01:37:05Z

I can't actually reproduce this.

As per your description a non existent file on priv s3 bucket that I have appropriate credentials to:

import pandas as pd
pd.read_csv("s3://dev/this_file_doesnt_exist.csv")

consistently throws a PermissionError for me

Anything else @marlosb I can do to repo?

pandas==1.0.3
s3fs==0.4.2

marlosb · 2020-04-21T20:40:17Z

Thanks for self-assigning and get in touch Alistair .

I couldn't reproduce the problem either.
I'm in using same versions of all libraries from the report. The only difference is that now I'm home and when I was facing the problem I was in corporate network, behind a proxy server. Something I can't reproduce now. Tried external proxy server without success but not sure if it is same behavior. Anyways, if the problem happens only in a specific networkI would say it is network fault.

Also, maybe the S3 behavior have changed since the report was submitted. I'm not sure if it is likely to happen.

I'm sorry for submitting a report that I can reproduce. Next time I'll test from several networks before submit.

alimcmaster1 · 2020-04-21T23:08:35Z

No worries @marlosb - thanks for following up and feel free to re open if required

marlosb changed the title ~~Mishandling exception when trying to access inexistent private files at S3 using several pandas.read_ methods~~ Mishandling exception when trying to access inexistent files at s private S3 bucket using several pandas.read_ methods Mar 5, 2020

marlosb changed the title ~~Mishandling exception when trying to access inexistent files at s private S3 bucket using several pandas.read_ methods~~ Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods Mar 5, 2020

TomAugspurger mentioned this issue Mar 11, 2020

to_csv fails silently with s3fs #32486

Closed

This was referenced Apr 18, 2020

to_parquet swallows NoCredentialsError #27679

Closed

IO: Fix S3 Error Handling #33645

Merged

jreback added the IO Data IO issues that don't fit into a more specific label label Apr 20, 2020

jreback added this to the 1.1 milestone Apr 20, 2020

jreback added the Error Reporting Incorrect or improved errors from pandas label Apr 20, 2020

alimcmaster1 self-assigned this Apr 21, 2020

marlosb closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods #32470

Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods #32470

marlosb commented Mar 5, 2020 •

edited

Loading

INSTALLED VERSIONS

marlosb commented Mar 5, 2020 •

edited

Loading

alimcmaster1 commented Apr 21, 2020

marlosb commented Apr 21, 2020

alimcmaster1 commented Apr 21, 2020

Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods #32470

Mishandling exception when trying to access inexistent files at private S3 bucket using several pandas.read_ methods #32470

Comments

marlosb commented Mar 5, 2020 • edited Loading

Problem description

Code to reproduce the issue

Root Cause

Why this is an issue

Double check the cause

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

marlosb commented Mar 5, 2020 • edited Loading

alimcmaster1 commented Apr 21, 2020

marlosb commented Apr 21, 2020

alimcmaster1 commented Apr 21, 2020

marlosb commented Mar 5, 2020 •

edited

Loading

Output of `pd.show_versions()`

marlosb commented Mar 5, 2020 •

edited

Loading