Skip to content

ENH: add fsspec support #34266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
94e717f
Add remote file io using fsspec.
Apr 14, 2020
fd7e072
Attempt refactor and clean
May 19, 2020
302ba13
Merge branch 'master' into feature/add-fsspec-support
May 20, 2020
9e6d3b2
readd and adapt s3/gcs tests
May 21, 2020
4564c8d
remove gc from test
May 21, 2020
0654537
Simpler is_fsspec
May 21, 2020
8d45cbb
add test
May 21, 2020
006e736
Answered most points
May 28, 2020
724ebd8
Implemented suggestions
May 28, 2020
9da1689
lint
May 28, 2020
a595411
Add versions info
May 29, 2020
6dd1e92
Update some deps
May 29, 2020
6e13df7
issue link syntax
May 29, 2020
3262063
More specific test versions
Jun 2, 2020
4bc2411
Account for alternate S3 protocols, and ignore type error
Jun 2, 2020
68644ab
Add comment to mypy ignore insrtuction
Jun 2, 2020
32bc586
more mypy
Jun 2, 2020
037ef2c
more black
Jun 2, 2020
c3c3075
Make storage_options a dict rather than swallowing kwargs
Jun 3, 2020
85d6452
More requested changes
Jun 5, 2020
263dd3b
Remove fsspec from locale tests
Jun 10, 2020
d0afbc3
tweak
Jun 10, 2020
6a587a5
Merge branch 'master' into feature/add-fsspec-support
Jun 10, 2020
b2992c1
Merge branch 'master' into feature/add-fsspec-support
Jun 11, 2020
9c03745
requested changes
Jun 11, 2020
7982e7b
add gcsfs to environment.yml
Jun 12, 2020
946297b
rerun deps script
Jun 12, 2020
145306e
Merge branch 'master' into feature/add-fsspec-support
Jun 12, 2020
06e5a3a
account for passed filesystem again
Jun 12, 2020
8f3854c
specify should_close
Jun 12, 2020
50c08c8
lint
Jun 12, 2020
9b20dc6
Except http passed to fsspec in parquet
Jun 12, 2020
eb90fe8
lint
Jun 12, 2020
b3e2cd2
Merge branch 'master' into feature/add-fsspec-support
Jun 16, 2020
4977a00
redo whatsnew
Jun 16, 2020
29a9785
simplify parquet write
Jun 18, 2020
565031b
Retry S3 file probe with timeout, in test_to_s3
Jun 18, 2020
606ce11
expand user in non-fsspec paths for parquet; add test for this
Jun 19, 2020
60b80a6
reorder imports!
Jun 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,8 +267,9 @@ SQLAlchemy 1.1.4 SQL support for databases other tha
SciPy 0.19.0 Miscellaneous statistical functions
XLsxWriter 0.9.8 Excel writing
blosc Compression for HDF5
fsspec 0.7.4 File operations handling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify the note a bit further, as now it sounds you need this for any kind of file operations, something like "for remote filesystems" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it handles other things too, just not "local" or "http(s)". That would make for a long comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then something like "filesystems other than local or http(s)" isn't too long I would say

fastparquet 0.3.2 Parquet reading / writing
gcsfs 0.2.2 Google Cloud Storage access
gcsfs 0.6.0 Google Cloud Storage access
html5lib HTML parser for read_html (see :ref:`note <optional_html>`)
lxml 3.8.0 HTML parser for read_html (see :ref:`note <optional_html>`)
matplotlib 2.2.2 Visualization
Expand All @@ -282,7 +283,7 @@ pyreadstat SPSS files (.sav) reading
pytables 3.4.3 HDF5 reading / writing
pyxlsb 1.0.6 Reading for xlsb files
qtpy Clipboard I/O
s3fs 0.3.0 Amazon S3 access
s3fs 0.4.0 Amazon S3 access
tabulate 0.8.3 Printing in Markdown-friendly format (see `tabulate`_)
xarray 0.8.2 pandas-like API for N-dimensional data
xclip Clipboard I/O on linux
Expand Down
14 changes: 14 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,20 @@ For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`.

.. _whatsnew_110.enhancements.other:

fsspec now used for filesystem handling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For reading and writing to filesystems other than local and reading from HTTP(S),
the optional dependency ``fsspec`` will be used to dispatch operations. This will give unchanged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the original issue at the end of the first sentence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not fixed yet.

functionality for S3 and GCS storage, which were already supported, but also add
support for several other storage implementations such as Azure Datalake and Blob,
SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_.

In the future, we will implement a way to pass parameters to the invoked
filesystem instances.

.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/

Other enhancements
^^^^^^^^^^^^^^^^^^

Expand Down
3 changes: 2 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,8 @@ dependencies:

- pyqt>=5.9.2 # pandas.read_clipboard
- pytables>=3.4.3 # pandas.read_hdf, DataFrame.to_hdf
- s3fs>=0.4.0 # pandas.read_csv... when using 's3://...' path (also brings in fsspec)
- s3fs>=0.4.0 # pandas.read_csv... when using 's3://...' path
- fsspec>=0.7.4 # for generic remote file operations
- sqlalchemy # pandas.read_sql, DataFrame.to_sql
- xarray # DataFrame.to_xarray
- cftime # Needed for downstream xarray.CFTimeIndex test
Expand Down
5 changes: 3 additions & 2 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@
VERSIONS = {
"bs4": "4.6.0",
"bottleneck": "1.2.1",
"fsspec": "0.7.4",
"fastparquet": "0.3.2",
"gcsfs": "0.2.2",
"gcsfs": "0.6.0",
"lxml.etree": "3.8.0",
"matplotlib": "2.2.2",
"numexpr": "2.6.2",
Expand All @@ -20,7 +21,7 @@
"pytables": "3.4.3",
"pytest": "5.0.1",
"pyxlsb": "1.0.6",
"s3fs": "0.3.0",
"s3fs": "0.4.0",
"scipy": "1.2.0",
"sqlalchemy": "1.1.4",
"tables": "3.4.3",
Expand Down
4 changes: 2 additions & 2 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@

from pandas._typing import FilePathOrBuffer
from pandas.compat import _get_lzma_file, _import_lzma
from pandas.compat._optional import import_optional_dependency

from pandas.core.dtypes.common import is_file_like

Expand Down Expand Up @@ -185,12 +186,11 @@ def get_filepath_or_buffer(
return reader, encoding, compression, True

if is_fsspec_url(filepath_or_buffer):
import fsspec
fsspec = import_optional_dependency('fsspec')

file_obj = fsspec.open(
filepath_or_buffer, mode=mode or "rb", **storage_options
).open()
# TODO: both fsspec and pandas handle compression and encoding
return file_obj, encoding, compression, True

if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)):
Expand Down
10 changes: 8 additions & 2 deletions pandas/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,10 @@ def write(
# write_to_dataset does not support a file-like object when
# a directory path is used, so just pass the path string.
if partition_cols is not None:
# user may provide filesystem= with an instance, in which case it takes priority
# and fsspec need not analyse the path
if is_fsspec_url(path) and "filesystem" not in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a comment explaining this "filesystem" not in kwargs check? It's not obvious to me why it's needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fsspec, you can specify the exact protocol you would like beyond that inferred from the URL. Given that we don't pass storage_options through yet, perhaps this gives more flexibility than required and I can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, edit on that: this is the filesystem parameter (i.e., an actual instance) to pyarrow. I have no idea if people might currently be using that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're saying the user could pass a filesystem like

df.to_parquet(..., filesystem=filesystem)

That certainly seems possible. Could you ensure that we have a test for that?

import_optional_dependency('fsspec')
import fsspec.core

fs, path = fsspec.core.url_to_fs(path)
Expand All @@ -123,12 +126,15 @@ def write(

def read(self, path, columns=None, **kwargs):
if is_fsspec_url(path) and "filesystem" not in kwargs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you additionally check that use_legacy_dataset=False is not in the kwargs ? As long as fsspec/filesystem_spec#295 is not solved, converting a string URI into a path + fsspec filesystem would make that option unusable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check this comment?
(I will try if I can write a test that would catch it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping for this one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am also fine with doing this as a follow-up myself)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you meant that you intended to handle it; and yes please, you are in the best place to check the finer details of the calls to pyarrow.

import_optional_dependency('fsspec')
import fsspec.core

fs, path = fsspec.core.url_to_fs(path)
parquet_ds = self.api.parquet.ParquetDataset(path, filesystem=fs, **kwargs)
else:
parquet_ds = self.api.parquet.ParquetDataset(path, **kwargs)
# this key valid for ParquetDataset but not read_pandas
kwargs.pop('filesystem', None)

kwargs["columns"] = columns
result = parquet_ds.read_pandas(**kwargs).to_pandas()
Expand Down Expand Up @@ -170,7 +176,7 @@ def write(
kwargs["file_scheme"] = "hive"

if is_fsspec_url(path):
import fsspec
fsspec = import_optional_dependency('fsspec')

# if filesystem is provided by fsspec, file must be opened in 'wb' mode.
kwargs["open_with"] = lambda path, _: fsspec.open(path, "wb").open()
Expand All @@ -189,7 +195,7 @@ def write(

def read(self, path, columns=None, **kwargs):
if is_fsspec_url(path):
import fsspec
fsspec = import_optional_dependency('fsspec')

open_with = lambda path, _: fsspec.open(path, "rb").open()
parquet_file = self.api.ParquetFile(path, open_with=open_with)
Expand Down
16 changes: 5 additions & 11 deletions pandas/tests/io/test_fsspec.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,16 +18,13 @@

@pytest.fixture
def cleared_fs():
import fsspec
fsspec = pytest.importorskip('fsspec')

memfs = fsspec.filesystem("memory")
try:
yield memfs
finally:
memfs.store.clear()
yield memfs
memfs.store.clear()


@td.skip_if_no("fsspec")
def test_read_csv(cleared_fs):
from fsspec.implementations.memory import MemoryFile

Expand All @@ -37,8 +34,7 @@ def test_read_csv(cleared_fs):
tm.assert_frame_equal(df1, df2)


@td.skip_if_no("fsspec")
def test_reasonable_error(monkeypatch):
def test_reasonable_error(monkeypatch, cleared_fs):
from fsspec.registry import known_implementations
from fsspec import registry

Expand All @@ -57,7 +53,6 @@ def test_reasonable_error(monkeypatch):
assert err_mgs in str(e.value)


@td.skip_if_no("fsspec")
def test_to_csv(cleared_fs):
df1.to_csv("memory://test/test.csv", index=True)
df2 = read_csv("memory://test/test.csv", parse_dates=["dt"], index_col=0)
Expand All @@ -66,8 +61,7 @@ def test_to_csv(cleared_fs):


@td.skip_if_no("fastparquet")
@td.skip_if_no("fsspec")
def test_to_parquet_new_file(monkeypatch):
def test_to_parquet_new_file(monkeypatch, cleared_fs):
"""Regression test for writing to a not-yet-existent GCS Parquet file."""
df1.to_parquet(
"memory://test/test.csv", index=True, engine="fastparquet", compression=None
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -537,6 +537,13 @@ def test_categorical(self, pa):
check_round_trip(df, pa, expected=expected)

def test_s3_roundtrip(self, df_compat, s3_resource, pa):
s3fs = pytest.importorskip("s3fs")
s3 = s3fs.S3FileSystem()
kw = dict(filesystem=s3)
check_round_trip(df_compat, pa, path="pandas-test/pyarrow.parquet",
read_kwargs=kw, write_kwargs=kw)

def test_s3_roundtrip_explicit_fs(self, df_compat, s3_resource, pa):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nitpick: I think it's the test above that has the explicit filesystem passed?

# GH #19134
check_round_trip(df_compat, pa, path="s3://pandas-test/pyarrow.parquet")

Expand Down