Skip to content

ENH: add fsspec support #34266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
94e717f
Add remote file io using fsspec.
Apr 14, 2020
fd7e072
Attempt refactor and clean
May 19, 2020
302ba13
Merge branch 'master' into feature/add-fsspec-support
May 20, 2020
9e6d3b2
readd and adapt s3/gcs tests
May 21, 2020
4564c8d
remove gc from test
May 21, 2020
0654537
Simpler is_fsspec
May 21, 2020
8d45cbb
add test
May 21, 2020
006e736
Answered most points
May 28, 2020
724ebd8
Implemented suggestions
May 28, 2020
9da1689
lint
May 28, 2020
a595411
Add versions info
May 29, 2020
6dd1e92
Update some deps
May 29, 2020
6e13df7
issue link syntax
May 29, 2020
3262063
More specific test versions
Jun 2, 2020
4bc2411
Account for alternate S3 protocols, and ignore type error
Jun 2, 2020
68644ab
Add comment to mypy ignore insrtuction
Jun 2, 2020
32bc586
more mypy
Jun 2, 2020
037ef2c
more black
Jun 2, 2020
c3c3075
Make storage_options a dict rather than swallowing kwargs
Jun 3, 2020
85d6452
More requested changes
Jun 5, 2020
263dd3b
Remove fsspec from locale tests
Jun 10, 2020
d0afbc3
tweak
Jun 10, 2020
6a587a5
Merge branch 'master' into feature/add-fsspec-support
Jun 10, 2020
b2992c1
Merge branch 'master' into feature/add-fsspec-support
Jun 11, 2020
9c03745
requested changes
Jun 11, 2020
7982e7b
add gcsfs to environment.yml
Jun 12, 2020
946297b
rerun deps script
Jun 12, 2020
145306e
Merge branch 'master' into feature/add-fsspec-support
Jun 12, 2020
06e5a3a
account for passed filesystem again
Jun 12, 2020
8f3854c
specify should_close
Jun 12, 2020
50c08c8
lint
Jun 12, 2020
9b20dc6
Except http passed to fsspec in parquet
Jun 12, 2020
eb90fe8
lint
Jun 12, 2020
b3e2cd2
Merge branch 'master' into feature/add-fsspec-support
Jun 16, 2020
4977a00
redo whatsnew
Jun 16, 2020
29a9785
simplify parquet write
Jun 18, 2020
565031b
Retry S3 file probe with timeout, in test_to_s3
Jun 18, 2020
606ce11
expand user in non-fsspec paths for parquet; add test for this
Jun 19, 2020
60b80a6
reorder imports!
Jun 19, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions ci/deps/azure-36-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ dependencies:

# pandas dependencies
- beautifulsoup4
- gcsfs
- html5lib
- ipython
- jinja2
Expand All @@ -31,7 +30,6 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- scipy
- xarray
- xlrd
Expand Down
1 change: 0 additions & 1 deletion ci/deps/azure-37-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- scipy
- xarray
- xlrd
Expand Down
5 changes: 3 additions & 2 deletions ci/deps/azure-windows-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@ dependencies:
# pandas dependencies
- beautifulsoup4
- bottleneck
- gcsfs
- fsspec>=0.7.4
- gcsfs>=0.6.0
- html5lib
- jinja2
- lxml
Expand All @@ -28,7 +29,7 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- s3fs>=0.4.0
- scipy
- sqlalchemy
- xlrd
Expand Down
5 changes: 3 additions & 2 deletions ci/deps/travis-36-cov.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ dependencies:
- cython>=0.29.16
- dask
- fastparquet>=0.3.2
- gcsfs
- fsspec>=0.7.4
- gcsfs>=0.6.0
- geopandas
- html5lib
- matplotlib
Expand All @@ -35,7 +36,7 @@ dependencies:
- pytables
- python-snappy
- pytz
- s3fs
- s3fs>=0.4.0
- scikit-learn
- scipy
- sqlalchemy
Expand Down
2 changes: 0 additions & 2 deletions ci/deps/travis-36-locale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ dependencies:
- blosc=1.14.3
- python-blosc
- fastparquet=0.3.2
- gcsfs=0.2.2
- html5lib
- ipython
- jinja2
Expand All @@ -33,7 +32,6 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs=0.3.0
- scipy
- sqlalchemy=1.1.4
- xarray=0.10
Expand Down
3 changes: 2 additions & 1 deletion ci/deps/travis-36-slow.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ dependencies:

# pandas dependencies
- beautifulsoup4
- fsspec>=0.7.4
- html5lib
- lxml
- matplotlib
Expand All @@ -25,7 +26,7 @@ dependencies:
- pytables
- python-dateutil
- pytz
- s3fs
- s3fs>=0.4.0
- scipy
- sqlalchemy
- xlrd
Expand Down
3 changes: 2 additions & 1 deletion ci/deps/travis-37.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ dependencies:

# pandas dependencies
- botocore>=1.11
- fsspec>=0.7.4
- numpy
- python-dateutil
- nomkl
- pyarrow
- pytz
- s3fs
- s3fs>=0.4.0
- tabulate
- pyreadstat
- pip
Expand Down
5 changes: 3 additions & 2 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,9 @@ SQLAlchemy 1.1.4 SQL support for databases other tha
SciPy 0.19.0 Miscellaneous statistical functions
XLsxWriter 0.9.8 Excel writing
blosc Compression for HDF5
fsspec 0.7.4 Handling files aside from local and HTTP
fastparquet 0.3.2 Parquet reading / writing
gcsfs 0.2.2 Google Cloud Storage access
gcsfs 0.6.0 Google Cloud Storage access
html5lib HTML parser for read_html (see :ref:`note <optional_html>`)
lxml 3.8.0 HTML parser for read_html (see :ref:`note <optional_html>`)
matplotlib 2.2.2 Visualization
Expand All @@ -282,7 +283,7 @@ pyreadstat SPSS files (.sav) reading
pytables 3.4.3 HDF5 reading / writing
pyxlsb 1.0.6 Reading for xlsb files
qtpy Clipboard I/O
s3fs 0.3.0 Amazon S3 access
s3fs 0.4.0 Amazon S3 access
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_)
xarray 0.8.2 pandas-like API for N-dimensional data
xclip Clipboard I/O on linux
Expand Down
22 changes: 20 additions & 2 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,22 @@ If needed you can adjust the bins with the argument ``offset`` (a Timedelta) tha

For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.

fsspec now used for filesystem handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For reading and writing to filesystems other than local and reading from HTTP(S),
the optional dependency ``fsspec`` will be used to dispatch operations (:issue:`33452`).
This will give unchanged
functionality for S3 and GCS storage, which were already supported, but also add
support for several other storage implementations such as `Azure Datalake and Blob`_,
SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_.

The existing capability to interface with S3 and GCS will be unaffected by this
change, as ``fsspec`` will still bring in the same packages as before.

.. _Azure Datalake and Blob: https://github.com/dask/adlfs

.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/

.. _whatsnew_110.enhancements.other:

Expand Down Expand Up @@ -696,7 +712,9 @@ Optional libraries below the lowest tested version may still work, but are not c
+-----------------+-----------------+---------+
| fastparquet | 0.3.2 | |
+-----------------+-----------------+---------+
| gcsfs | 0.2.2 | |
| fsspec | 0.7.4 | |
+-----------------+-----------------+---------+
| gcsfs | 0.6.0 | X |
+-----------------+-----------------+---------+
| lxml | 3.8.0 | |
+-----------------+-----------------+---------+
Expand All @@ -712,7 +730,7 @@ Optional libraries below the lowest tested version may still work, but are not c
+-----------------+-----------------+---------+
| pytables | 3.4.3 | X |
+-----------------+-----------------+---------+
| s3fs | 0.3.0 | |
| s3fs | 0.4.0 | X |
+-----------------+-----------------+---------+
| scipy | 1.2.0 | X |
+-----------------+-----------------+---------+
Expand Down
4 changes: 3 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,9 @@ dependencies:

- pyqt>=5.9.2 # pandas.read_clipboard
- pytables>=3.4.3 # pandas.read_hdf, DataFrame.to_hdf
- s3fs # pandas.read_csv... when using 's3://...' path
- s3fs>=0.4.0 # file IO when using 's3://...' path
- fsspec>=0.7.4 # for generic remote file operations
- gcsfs>=0.6.0 # file IO when using 'gcs://...' path
- sqlalchemy # pandas.read_sql, DataFrame.to_sql
- xarray # DataFrame.to_xarray
- cftime # Needed for downstream xarray.CFTimeIndex test
Expand Down
5 changes: 3 additions & 2 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@
VERSIONS = {
"bs4": "4.6.0",
"bottleneck": "1.2.1",
"fsspec": "0.7.4",
"fastparquet": "0.3.2",
"gcsfs": "0.2.2",
"gcsfs": "0.6.0",
"lxml.etree": "3.8.0",
"matplotlib": "2.2.2",
"numexpr": "2.6.2",
Expand All @@ -20,7 +21,7 @@
"pytables": "3.4.3",
"pytest": "5.0.1",
"pyxlsb": "1.0.6",
"s3fs": "0.3.0",
"s3fs": "0.4.0",
"scipy": "1.2.0",
"sqlalchemy": "1.1.4",
"tables": "3.4.3",
Expand Down
80 changes: 30 additions & 50 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@

from pandas._typing import FilePathOrBuffer
from pandas.compat import _get_lzma_file, _import_lzma
from pandas.compat._optional import import_optional_dependency

from pandas.core.dtypes.common import is_file_like

Expand Down Expand Up @@ -126,20 +127,6 @@ def stringify_path(
return _expand_user(filepath_or_buffer)


def is_s3_url(url) -> bool:
"""Check for an s3, s3n, or s3a url"""
if not isinstance(url, str):
return False
return parse_url(url).scheme in ["s3", "s3n", "s3a"]


def is_gcs_url(url) -> bool:
"""Check for a gcs url"""
if not isinstance(url, str):
return False
return parse_url(url).scheme in ["gcs", "gs"]


def urlopen(*args, **kwargs):
"""
Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
Expand All @@ -150,38 +137,24 @@ def urlopen(*args, **kwargs):
return urllib.request.urlopen(*args, **kwargs)


def get_fs_for_path(filepath: str):
def is_fsspec_url(url: FilePathOrBuffer) -> bool:
"""
Get appropriate filesystem given a filepath.
Supports s3fs, gcs and local file system.

Parameters
----------
filepath : str
File path. e.g s3://bucket/object, /local/path, gcs://pandas/obj

Returns
-------
s3fs.S3FileSystem, gcsfs.GCSFileSystem, None
Appropriate FileSystem to use. None for local filesystem.
Returns true if the given URL looks like
something fsspec can handle
"""
if is_s3_url(filepath):
from pandas.io import s3

return s3.get_fs()
elif is_gcs_url(filepath):
from pandas.io import gcs

return gcs.get_fs()
else:
return None
return (
isinstance(url, str)
and "://" in url
and not url.startswith(("http://", "https://"))
)


def get_filepath_or_buffer(
filepath_or_buffer: FilePathOrBuffer,
encoding: Optional[str] = None,
compression: Optional[str] = None,
mode: Optional[str] = None,
storage_options: Optional[Dict[str, Any]] = None,
):
"""
If the filepath_or_buffer is a url, translate and return the buffer.
Expand All @@ -194,6 +167,8 @@ def get_filepath_or_buffer(
compression : {{'gzip', 'bz2', 'zip', 'xz', None}}, optional
encoding : the encoding to use to decode bytes, default is 'utf-8'
mode : str, optional
storage_options: dict, optional
passed on to fsspec, if using it; this is not yet accessed by the public API

Returns
-------
Expand All @@ -204,6 +179,7 @@ def get_filepath_or_buffer(
filepath_or_buffer = stringify_path(filepath_or_buffer)

if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
# TODO: fsspec can also handle HTTP via requests, but leaving this unchanged
req = urlopen(filepath_or_buffer)
content_encoding = req.headers.get("Content-Encoding", None)
if content_encoding == "gzip":
Expand All @@ -213,19 +189,23 @@ def get_filepath_or_buffer(
req.close()
return reader, encoding, compression, True

if is_s3_url(filepath_or_buffer):
from pandas.io import s3

return s3.get_filepath_or_buffer(
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
)

if is_gcs_url(filepath_or_buffer):
from pandas.io import gcs

return gcs.get_filepath_or_buffer(
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
)
if is_fsspec_url(filepath_or_buffer):
assert isinstance(
filepath_or_buffer, str
) # just to appease mypy for this branch
# two special-case s3-like protocols; these have special meaning in Hadoop,
# but are equivalent to just "s3" from fsspec's point of view
# cc #11071
if filepath_or_buffer.startswith("s3a://"):
filepath_or_buffer = filepath_or_buffer.replace("s3a://", "s3://")
if filepath_or_buffer.startswith("s3n://"):
filepath_or_buffer = filepath_or_buffer.replace("s3n://", "s3://")
fsspec = import_optional_dependency("fsspec")

file_obj = fsspec.open(
filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
).open()
return file_obj, encoding, compression, True

if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)):
return _expand_user(filepath_or_buffer), None, compression, False
Expand Down
22 changes: 0 additions & 22 deletions pandas/io/gcs.py

This file was deleted.

Loading