Skip to content

Commit 38f4af9

Browse files
martindurantJulian de Ruiter
and
Julian de Ruiter
authored
ENH: add fsspec support (#34266)
Co-authored-by: Julian de Ruiter <[email protected]>
1 parent 506eb54 commit 38f4af9

23 files changed

+279
-250
lines changed

ci/deps/azure-36-locale.yaml

-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@ dependencies:
1515

1616
# pandas dependencies
1717
- beautifulsoup4
18-
- gcsfs
1918
- html5lib
2019
- ipython
2120
- jinja2
@@ -31,7 +30,6 @@ dependencies:
3130
- pytables
3231
- python-dateutil
3332
- pytz
34-
- s3fs
3533
- scipy
3634
- xarray
3735
- xlrd

ci/deps/azure-37-locale.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ dependencies:
2727
- pytables
2828
- python-dateutil
2929
- pytz
30-
- s3fs
3130
- scipy
3231
- xarray
3332
- xlrd

ci/deps/azure-windows-37.yaml

+3-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@ dependencies:
1515
# pandas dependencies
1616
- beautifulsoup4
1717
- bottleneck
18-
- gcsfs
18+
- fsspec>=0.7.4
19+
- gcsfs>=0.6.0
1920
- html5lib
2021
- jinja2
2122
- lxml
@@ -28,7 +29,7 @@ dependencies:
2829
- pytables
2930
- python-dateutil
3031
- pytz
31-
- s3fs
32+
- s3fs>=0.4.0
3233
- scipy
3334
- sqlalchemy
3435
- xlrd

ci/deps/travis-36-cov.yaml

+3-2
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ dependencies:
1818
- cython>=0.29.16
1919
- dask
2020
- fastparquet>=0.3.2
21-
- gcsfs
21+
- fsspec>=0.7.4
22+
- gcsfs>=0.6.0
2223
- geopandas
2324
- html5lib
2425
- matplotlib
@@ -35,7 +36,7 @@ dependencies:
3536
- pytables
3637
- python-snappy
3738
- pytz
38-
- s3fs
39+
- s3fs>=0.4.0
3940
- scikit-learn
4041
- scipy
4142
- sqlalchemy

ci/deps/travis-36-locale.yaml

-2
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,6 @@ dependencies:
1616
- blosc=1.14.3
1717
- python-blosc
1818
- fastparquet=0.3.2
19-
- gcsfs=0.2.2
2019
- html5lib
2120
- ipython
2221
- jinja2
@@ -33,7 +32,6 @@ dependencies:
3332
- pytables
3433
- python-dateutil
3534
- pytz
36-
- s3fs=0.3.0
3735
- scipy
3836
- sqlalchemy=1.1.4
3937
- xarray=0.10

ci/deps/travis-36-slow.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ dependencies:
1313

1414
# pandas dependencies
1515
- beautifulsoup4
16+
- fsspec>=0.7.4
1617
- html5lib
1718
- lxml
1819
- matplotlib
@@ -25,7 +26,7 @@ dependencies:
2526
- pytables
2627
- python-dateutil
2728
- pytz
28-
- s3fs
29+
- s3fs>=0.4.0
2930
- scipy
3031
- sqlalchemy
3132
- xlrd

ci/deps/travis-37.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,13 @@ dependencies:
1313

1414
# pandas dependencies
1515
- botocore>=1.11
16+
- fsspec>=0.7.4
1617
- numpy
1718
- python-dateutil
1819
- nomkl
1920
- pyarrow
2021
- pytz
21-
- s3fs
22+
- s3fs>=0.4.0
2223
- tabulate
2324
- pyreadstat
2425
- pip

doc/source/getting_started/install.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -267,8 +267,9 @@ SQLAlchemy 1.1.4 SQL support for databases other tha
267267
SciPy 0.19.0 Miscellaneous statistical functions
268268
XLsxWriter 0.9.8 Excel writing
269269
blosc Compression for HDF5
270+
fsspec 0.7.4 Handling files aside from local and HTTP
270271
fastparquet 0.3.2 Parquet reading / writing
271-
gcsfs 0.2.2 Google Cloud Storage access
272+
gcsfs 0.6.0 Google Cloud Storage access
272273
html5lib HTML parser for read_html (see :ref:`note <optional_html>`)
273274
lxml 3.8.0 HTML parser for read_html (see :ref:`note <optional_html>`)
274275
matplotlib 2.2.2 Visualization
@@ -282,7 +283,7 @@ pyreadstat SPSS files (.sav) reading
282283
pytables 3.4.3 HDF5 reading / writing
283284
pyxlsb 1.0.6 Reading for xlsb files
284285
qtpy Clipboard I/O
285-
s3fs 0.3.0 Amazon S3 access
286+
s3fs 0.4.0 Amazon S3 access
286287
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_)
287288
xarray 0.8.2 pandas-like API for N-dimensional data
288289
xclip Clipboard I/O on linux

doc/source/whatsnew/v1.1.0.rst

+20-2
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,22 @@ If needed you can adjust the bins with the argument ``offset`` (a Timedelta) tha
245245

246246
For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.
247247

248+
fsspec now used for filesystem handling
249+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
250+
251+
For reading and writing to filesystems other than local and reading from HTTP(S),
252+
the optional dependency ``fsspec`` will be used to dispatch operations (:issue:`33452`).
253+
This will give unchanged
254+
functionality for S3 and GCS storage, which were already supported, but also add
255+
support for several other storage implementations such as `Azure Datalake and Blob`_,
256+
SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_.
257+
258+
The existing capability to interface with S3 and GCS will be unaffected by this
259+
change, as ``fsspec`` will still bring in the same packages as before.
260+
261+
.. _Azure Datalake and Blob: https://github.com/dask/adlfs
262+
263+
.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/
248264

249265
.. _whatsnew_110.enhancements.other:
250266

@@ -701,7 +717,9 @@ Optional libraries below the lowest tested version may still work, but are not c
701717
+-----------------+-----------------+---------+
702718
| fastparquet | 0.3.2 | |
703719
+-----------------+-----------------+---------+
704-
| gcsfs | 0.2.2 | |
720+
| fsspec | 0.7.4 | |
721+
+-----------------+-----------------+---------+
722+
| gcsfs | 0.6.0 | X |
705723
+-----------------+-----------------+---------+
706724
| lxml | 3.8.0 | |
707725
+-----------------+-----------------+---------+
@@ -717,7 +735,7 @@ Optional libraries below the lowest tested version may still work, but are not c
717735
+-----------------+-----------------+---------+
718736
| pytables | 3.4.3 | X |
719737
+-----------------+-----------------+---------+
720-
| s3fs | 0.3.0 | |
738+
| s3fs | 0.4.0 | X |
721739
+-----------------+-----------------+---------+
722740
| scipy | 1.2.0 | X |
723741
+-----------------+-----------------+---------+

environment.yml

+3-1
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,9 @@ dependencies:
9898

9999
- pyqt>=5.9.2 # pandas.read_clipboard
100100
- pytables>=3.4.3 # pandas.read_hdf, DataFrame.to_hdf
101-
- s3fs # pandas.read_csv... when using 's3://...' path
101+
- s3fs>=0.4.0 # file IO when using 's3://...' path
102+
- fsspec>=0.7.4 # for generic remote file operations
103+
- gcsfs>=0.6.0 # file IO when using 'gcs://...' path
102104
- sqlalchemy # pandas.read_sql, DataFrame.to_sql
103105
- xarray # DataFrame.to_xarray
104106
- cftime # Needed for downstream xarray.CFTimeIndex test

pandas/compat/_optional.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,9 @@
88
VERSIONS = {
99
"bs4": "4.6.0",
1010
"bottleneck": "1.2.1",
11+
"fsspec": "0.7.4",
1112
"fastparquet": "0.3.2",
12-
"gcsfs": "0.2.2",
13+
"gcsfs": "0.6.0",
1314
"lxml.etree": "3.8.0",
1415
"matplotlib": "2.2.2",
1516
"numexpr": "2.6.2",
@@ -20,7 +21,7 @@
2021
"pytables": "3.4.3",
2122
"pytest": "5.0.1",
2223
"pyxlsb": "1.0.6",
23-
"s3fs": "0.3.0",
24+
"s3fs": "0.4.0",
2425
"scipy": "1.2.0",
2526
"sqlalchemy": "1.1.4",
2627
"tables": "3.4.3",

pandas/io/common.py

+30-50
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131

3232
from pandas._typing import FilePathOrBuffer
3333
from pandas.compat import _get_lzma_file, _import_lzma
34+
from pandas.compat._optional import import_optional_dependency
3435

3536
from pandas.core.dtypes.common import is_file_like
3637

@@ -126,20 +127,6 @@ def stringify_path(
126127
return _expand_user(filepath_or_buffer)
127128

128129

129-
def is_s3_url(url) -> bool:
130-
"""Check for an s3, s3n, or s3a url"""
131-
if not isinstance(url, str):
132-
return False
133-
return parse_url(url).scheme in ["s3", "s3n", "s3a"]
134-
135-
136-
def is_gcs_url(url) -> bool:
137-
"""Check for a gcs url"""
138-
if not isinstance(url, str):
139-
return False
140-
return parse_url(url).scheme in ["gcs", "gs"]
141-
142-
143130
def urlopen(*args, **kwargs):
144131
"""
145132
Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
@@ -150,38 +137,24 @@ def urlopen(*args, **kwargs):
150137
return urllib.request.urlopen(*args, **kwargs)
151138

152139

153-
def get_fs_for_path(filepath: str):
140+
def is_fsspec_url(url: FilePathOrBuffer) -> bool:
154141
"""
155-
Get appropriate filesystem given a filepath.
156-
Supports s3fs, gcs and local file system.
157-
158-
Parameters
159-
----------
160-
filepath : str
161-
File path. e.g s3://bucket/object, /local/path, gcs://pandas/obj
162-
163-
Returns
164-
-------
165-
s3fs.S3FileSystem, gcsfs.GCSFileSystem, None
166-
Appropriate FileSystem to use. None for local filesystem.
142+
Returns true if the given URL looks like
143+
something fsspec can handle
167144
"""
168-
if is_s3_url(filepath):
169-
from pandas.io import s3
170-
171-
return s3.get_fs()
172-
elif is_gcs_url(filepath):
173-
from pandas.io import gcs
174-
175-
return gcs.get_fs()
176-
else:
177-
return None
145+
return (
146+
isinstance(url, str)
147+
and "://" in url
148+
and not url.startswith(("http://", "https://"))
149+
)
178150

179151

180152
def get_filepath_or_buffer(
181153
filepath_or_buffer: FilePathOrBuffer,
182154
encoding: Optional[str] = None,
183155
compression: Optional[str] = None,
184156
mode: Optional[str] = None,
157+
storage_options: Optional[Dict[str, Any]] = None,
185158
):
186159
"""
187160
If the filepath_or_buffer is a url, translate and return the buffer.
@@ -194,6 +167,8 @@ def get_filepath_or_buffer(
194167
compression : {{'gzip', 'bz2', 'zip', 'xz', None}}, optional
195168
encoding : the encoding to use to decode bytes, default is 'utf-8'
196169
mode : str, optional
170+
storage_options: dict, optional
171+
passed on to fsspec, if using it; this is not yet accessed by the public API
197172
198173
Returns
199174
-------
@@ -204,6 +179,7 @@ def get_filepath_or_buffer(
204179
filepath_or_buffer = stringify_path(filepath_or_buffer)
205180

206181
if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
182+
# TODO: fsspec can also handle HTTP via requests, but leaving this unchanged
207183
req = urlopen(filepath_or_buffer)
208184
content_encoding = req.headers.get("Content-Encoding", None)
209185
if content_encoding == "gzip":
@@ -213,19 +189,23 @@ def get_filepath_or_buffer(
213189
req.close()
214190
return reader, encoding, compression, True
215191

216-
if is_s3_url(filepath_or_buffer):
217-
from pandas.io import s3
218-
219-
return s3.get_filepath_or_buffer(
220-
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
221-
)
222-
223-
if is_gcs_url(filepath_or_buffer):
224-
from pandas.io import gcs
225-
226-
return gcs.get_filepath_or_buffer(
227-
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
228-
)
192+
if is_fsspec_url(filepath_or_buffer):
193+
assert isinstance(
194+
filepath_or_buffer, str
195+
) # just to appease mypy for this branch
196+
# two special-case s3-like protocols; these have special meaning in Hadoop,
197+
# but are equivalent to just "s3" from fsspec's point of view
198+
# cc #11071
199+
if filepath_or_buffer.startswith("s3a://"):
200+
filepath_or_buffer = filepath_or_buffer.replace("s3a://", "s3://")
201+
if filepath_or_buffer.startswith("s3n://"):
202+
filepath_or_buffer = filepath_or_buffer.replace("s3n://", "s3://")
203+
fsspec = import_optional_dependency("fsspec")
204+
205+
file_obj = fsspec.open(
206+
filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
207+
).open()
208+
return file_obj, encoding, compression, True
229209

230210
if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)):
231211
return _expand_user(filepath_or_buffer), None, compression, False

pandas/io/gcs.py

-22
This file was deleted.

0 commit comments

Comments
 (0)