-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
COMPAT/REF: Use s3fs for s3 IO #13137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,7 +11,7 @@ sqlalchemy | |
lxml=3.2.1 | ||
scipy | ||
xlsxwriter | ||
boto | ||
s3fs | ||
bottleneck | ||
html5lib | ||
beautiful-soup | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,7 +11,7 @@ sqlalchemy=0.9.6 | |
lxml=3.2.1 | ||
scipy | ||
xlsxwriter=0.4.6 | ||
boto=2.36.0 | ||
s3fs | ||
bottleneck | ||
psycopg2=2.5.2 | ||
patsy | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,7 +13,7 @@ numexpr | |
pytables | ||
sqlalchemy | ||
lxml | ||
boto | ||
s3fs | ||
bottleneck | ||
psycopg2 | ||
pymysql | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,7 @@ sqlalchemy | |
pymysql | ||
psycopg2 | ||
xarray | ||
boto | ||
s3fs | ||
|
||
# incompat with conda ATM | ||
# beautiful-soup |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,7 +12,7 @@ matplotlib | |
jinja2 | ||
bottleneck | ||
xarray | ||
boto | ||
s3fs | ||
|
||
# incompat with conda ATM | ||
# beautiful-soup |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -107,12 +107,12 @@ Other enhancements | |
- ``pandas.io.json.json_normalize()`` gained the option ``errors='ignore'|'raise'``; the default is ``errors='raise'`` which is backward compatible. (:issue:`14583`) | ||
|
||
|
||
|
||
.. _whatsnew_0200.api_breaking: | ||
|
||
Backwards incompatible API changes | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
|
||
.. _whatsnew.api_breaking.index_map | ||
|
||
Map on Index types now return other Index types | ||
|
@@ -181,8 +181,16 @@ Map on Index types now return other Index types | |
|
||
s.map(lambda x: x.hour) | ||
|
||
.. _whatsnew_0200.s3: | ||
|
||
S3 File Handling | ||
^^^^^^^^^^^^^^^^ | ||
|
||
.. _whatsnew_0200.api: | ||
pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need a doc section itself somewhere? so that we can put a url in various doc-strings, e.g. read_csv? (can always do as a followup) |
||
any code. However, since s3fs is not a required dependency, you will need to install it separately (like boto | ||
in prior versions of pandas) (:issue:`11915`). | ||
|
||
.. _whatsnew_0200.api: | ||
|
||
- ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv`` and will be removed in the future (:issue:`12665`) | ||
- ``SparseArray.cumsum()`` and ``SparseSeries.cumsum()`` will now always return ``SparseArray`` and ``SparseSeries`` respectively (:issue:`12855`) | ||
|
@@ -193,7 +201,6 @@ Map on Index types now return other Index types | |
Other API Changes | ||
^^^^^^^^^^^^^^^^^ | ||
|
||
|
||
.. _whatsnew_0200.deprecations: | ||
|
||
Deprecations | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,110 +1,35 @@ | ||
""" s3 support for remote file interactivity """ | ||
|
||
import os | ||
from pandas import compat | ||
from pandas.compat import BytesIO | ||
|
||
try: | ||
import boto | ||
from boto.s3 import key | ||
import s3fs | ||
from botocore.exceptions import NoCredentialsError | ||
except: | ||
raise ImportError("boto is required to handle s3 files") | ||
raise ImportError("The s3fs library is required to handle s3 files") | ||
|
||
if compat.PY3: | ||
from urllib.parse import urlparse as parse_url | ||
else: | ||
from urlparse import urlparse as parse_url | ||
|
||
|
||
class BotoFileLikeReader(key.Key): | ||
"""boto Key modified to be more file-like | ||
|
||
This modification of the boto Key will read through a supplied | ||
S3 key once, then stop. The unmodified boto Key object will repeatedly | ||
cycle through a file in S3: after reaching the end of the file, | ||
boto will close the file. Then the next call to `read` or `next` will | ||
re-open the file and start reading from the beginning. | ||
|
||
Also adds a `readline` function which will split the returned | ||
values by the `\n` character. | ||
""" | ||
|
||
def __init__(self, *args, **kwargs): | ||
encoding = kwargs.pop("encoding", None) # Python 2 compat | ||
super(BotoFileLikeReader, self).__init__(*args, **kwargs) | ||
# Add a flag to mark the end of the read. | ||
self.finished_read = False | ||
self.buffer = "" | ||
self.lines = [] | ||
if encoding is None and compat.PY3: | ||
encoding = "utf-8" | ||
self.encoding = encoding | ||
self.lines = [] | ||
|
||
def next(self): | ||
return self.readline() | ||
|
||
__next__ = next | ||
|
||
def read(self, *args, **kwargs): | ||
if self.finished_read: | ||
return b'' if compat.PY3 else '' | ||
return super(BotoFileLikeReader, self).read(*args, **kwargs) | ||
|
||
def close(self, *args, **kwargs): | ||
self.finished_read = True | ||
return super(BotoFileLikeReader, self).close(*args, **kwargs) | ||
|
||
def seekable(self): | ||
"""Needed for reading by bz2""" | ||
return False | ||
|
||
def readline(self): | ||
"""Split the contents of the Key by '\n' characters.""" | ||
if self.lines: | ||
retval = self.lines[0] | ||
self.lines = self.lines[1:] | ||
return retval | ||
if self.finished_read: | ||
if self.buffer: | ||
retval, self.buffer = self.buffer, "" | ||
return retval | ||
else: | ||
raise StopIteration | ||
|
||
if self.encoding: | ||
self.buffer = "{}{}".format( | ||
self.buffer, self.read(8192).decode(self.encoding)) | ||
else: | ||
self.buffer = "{}{}".format(self.buffer, self.read(8192)) | ||
|
||
split_buffer = self.buffer.split("\n") | ||
self.lines.extend(split_buffer[:-1]) | ||
self.buffer = split_buffer[-1] | ||
|
||
return self.readline() | ||
def _strip_schema(url): | ||
"""Returns the url without the s3:// part""" | ||
result = parse_url(url) | ||
return result.netloc + result.path | ||
|
||
|
||
def get_filepath_or_buffer(filepath_or_buffer, encoding=None, | ||
compression=None): | ||
|
||
# Assuming AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_S3_HOST | ||
# are environment variables | ||
parsed_url = parse_url(filepath_or_buffer) | ||
s3_host = os.environ.get('AWS_S3_HOST', 's3.amazonaws.com') | ||
|
||
fs = s3fs.S3FileSystem(anon=False) | ||
try: | ||
conn = boto.connect_s3(host=s3_host) | ||
except boto.exception.NoAuthHandlerFound: | ||
conn = boto.connect_s3(host=s3_host, anon=True) | ||
|
||
b = conn.get_bucket(parsed_url.netloc, validate=False) | ||
if compat.PY2 and compression: | ||
k = boto.s3.key.Key(b, parsed_url.path) | ||
filepath_or_buffer = BytesIO(k.get_contents_as_string( | ||
encoding=encoding)) | ||
else: | ||
k = BotoFileLikeReader(b, parsed_url.path, encoding=encoding) | ||
k.open('r') # Expose read errors immediately | ||
filepath_or_buffer = k | ||
filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer)) | ||
except (OSError, NoCredentialsError): | ||
# boto3 has troubles when trying to access a public file | ||
# when credentialed... | ||
# An OSError is raised if you have credentials, but they | ||
# aren't valid for that bucket. | ||
# A NoCredentialsError is raised if you don't have creds | ||
# for that bucket. | ||
fs = s3fs.S3FileSystem(anon=True) | ||
filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer)) | ||
return filepath_or_buffer, None, compression |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a ref tag