Skip to content

Commit eefa29f

Browse files
ENH: More permissive S3 reading
When calling `get_bucket`, boto will by default try to establish that the S3 bucket exists by listing all of the keys that exist in it. This behavior is controlled by the "validate" keyword, which defaults to True. If your access key doesn't have permission to read everything in a bucket (even if you do have permission to read the file you're trying to access), this generates an uninformative exception. This PR sets "validate=False". This means that boto will trust you that the bucket exists, and not try to check immediately. If the bucket actually doesn't exist, the `get_contents_as_string` call a couple of lines later will generate the exception "S3ResponseError: S3ResponseError: 404 Not Found". One of the test cases expected a failure when reading the file "s3://cant_get_it/tips.csv"; with the changes in this PR, this file is now accessible.
1 parent d25a9f3 commit eefa29f

File tree

3 files changed

+13
-3
lines changed

3 files changed

+13
-3
lines changed

doc/source/whatsnew/v0.17.0.txt

+2
Original file line numberDiff line numberDiff line change
@@ -389,3 +389,5 @@ Bug Fixes
389389
- Reading "famafrench" data via ``DataReader`` results in HTTP 404 error because of the website url is changed (:issue:`10591`).
390390

391391
- Bug in `read_msgpack` where DataFrame to decode has duplicate column names (:issue:`9618`)
392+
393+
- Bug in ``io.common.get_filepath_or_buffer`` which caused reading of valid S3 files to fail if the bucket also contained keys for which the user does not have read permission (:issue:`10604`)

pandas/io/common.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None):
151151
except boto.exception.NoAuthHandlerFound:
152152
conn = boto.connect_s3(anon=True)
153153

154-
b = conn.get_bucket(parsed_url.netloc)
154+
b = conn.get_bucket(parsed_url.netloc, validate=False)
155155
k = boto.s3.key.Key(b)
156156
k.key = parsed_url.path
157157
filepath_or_buffer = BytesIO(k.get_contents_as_string(

pandas/io/tests/test_parsers.py

+10-2
Original file line numberDiff line numberDiff line change
@@ -4075,16 +4075,24 @@ def test_parse_public_s3_bucket(self):
40754075
nt.assert_false(df.empty)
40764076
tm.assert_frame_equal(pd.read_csv(tm.get_data_path('tips.csv')), df)
40774077

4078+
# Read public file from bucket with not-public contents
4079+
df = pd.read_csv('s3://cant_get_it/tips.csv')
4080+
nt.assert_true(isinstance(df, pd.DataFrame))
4081+
nt.assert_false(df.empty)
4082+
tm.assert_frame_equal(pd.read_csv(tm.get_data_path('tips.csv')), df)
4083+
40784084
@tm.network
40794085
def test_s3_fails(self):
40804086
import boto
40814087
with tm.assertRaisesRegexp(boto.exception.S3ResponseError,
40824088
'S3ResponseError: 404 Not Found'):
40834089
pd.read_csv('s3://nyqpug/asdf.csv')
40844090

4091+
# Receive a permission error when trying to read a private bucket.
4092+
# It's irrelevant here that this isn't actually a table.
40854093
with tm.assertRaisesRegexp(boto.exception.S3ResponseError,
4086-
'S3ResponseError: 403 Forbidden'):
4087-
pd.read_csv('s3://cant_get_it/tips.csv')
4094+
'S3ResponseError: 403 Forbidden'):
4095+
pd.read_csv('s3://cant_get_it/')
40884096

40894097

40904098
def assert_same_values_and_dtype(res, exp):

0 commit comments

Comments
 (0)