-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add fsspec support #34266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add fsspec support #34266
Changes from 1 commit
94e717f
fd7e072
302ba13
9e6d3b2
4564c8d
0654537
8d45cbb
006e736
724ebd8
9da1689
a595411
6dd1e92
6e13df7
3262063
4bc2411
68644ab
32bc586
037ef2c
c3c3075
85d6452
263dd3b
d0afbc3
6a587a5
b2992c1
9c03745
7982e7b
946297b
145306e
06e5a3a
8f3854c
50c08c8
9b20dc6
eb90fe8
b3e2cd2
4977a00
29a9785
565031b
606ce11
60b80a6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -197,6 +197,20 @@ For a full example, see: :ref:`timeseries.adjust-the-start-of-the-bins`. | |
|
||
.. _whatsnew_110.enhancements.other: | ||
|
||
fsspec now used for filesystem handling | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
For reading and writing to filesystems other than local and reading from HTTP(S), | ||
the optional dependency ``fsspec`` will be used to dispatch operations. This will give unchanged | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Link to the original issue at the end of the first sentence. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also not fixed yet. |
||
functionality for S3 and GCS storage, which were already supported, but also add | ||
support for several other storage implementations such as Azure Datalake and Blob, | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
SSH, FTP, dropbox and github. For docs and capabilities, see the `fsspec docs`_. | ||
|
||
In the future, we will implement a way to pass parameters to the invoked | ||
filesystem instances. | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
.. _fsspec docs: https://filesystem-spec.readthedocs.io/en/latest/ | ||
|
||
Other enhancements | ||
^^^^^^^^^^^^^^^^^^ | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -102,7 +102,10 @@ def write( | |
# write_to_dataset does not support a file-like object when | ||
# a directory path is used, so just pass the path string. | ||
if partition_cols is not None: | ||
# user may provide filesystem= with an instance, in which case it takes priority | ||
# and fsspec need not analyse the path | ||
if is_fsspec_url(path) and "filesystem" not in kwargs: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you leave a comment explaining this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, edit on that: this is the filesystem parameter (i.e., an actual instance) to pyarrow. I have no idea if people might currently be using that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, you're saying the user could pass a filesystem like
That certainly seems possible. Could you ensure that we have a test for that? |
||
import_optional_dependency('fsspec') | ||
import fsspec.core | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
fs, path = fsspec.core.url_to_fs(path) | ||
|
@@ -123,12 +126,15 @@ def write( | |
|
||
def read(self, path, columns=None, **kwargs): | ||
if is_fsspec_url(path) and "filesystem" not in kwargs: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you additionally check that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you check this comment? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ping for this one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I am also fine with doing this as a follow-up myself) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought you meant that you intended to handle it; and yes please, you are in the best place to check the finer details of the calls to pyarrow. |
||
import_optional_dependency('fsspec') | ||
import fsspec.core | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
fs, path = fsspec.core.url_to_fs(path) | ||
martindurant marked this conversation as resolved.
Show resolved
Hide resolved
|
||
parquet_ds = self.api.parquet.ParquetDataset(path, filesystem=fs, **kwargs) | ||
else: | ||
parquet_ds = self.api.parquet.ParquetDataset(path, **kwargs) | ||
# this key valid for ParquetDataset but not read_pandas | ||
kwargs.pop('filesystem', None) | ||
|
||
kwargs["columns"] = columns | ||
result = parquet_ds.read_pandas(**kwargs).to_pandas() | ||
|
@@ -170,7 +176,7 @@ def write( | |
kwargs["file_scheme"] = "hive" | ||
|
||
if is_fsspec_url(path): | ||
import fsspec | ||
fsspec = import_optional_dependency('fsspec') | ||
|
||
# if filesystem is provided by fsspec, file must be opened in 'wb' mode. | ||
kwargs["open_with"] = lambda path, _: fsspec.open(path, "wb").open() | ||
|
@@ -189,7 +195,7 @@ def write( | |
|
||
def read(self, path, columns=None, **kwargs): | ||
if is_fsspec_url(path): | ||
import fsspec | ||
fsspec = import_optional_dependency('fsspec') | ||
|
||
open_with = lambda path, _: fsspec.open(path, "rb").open() | ||
parquet_file = self.api.ParquetFile(path, open_with=open_with) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -537,6 +537,13 @@ def test_categorical(self, pa): | |
check_round_trip(df, pa, expected=expected) | ||
|
||
def test_s3_roundtrip(self, df_compat, s3_resource, pa): | ||
s3fs = pytest.importorskip("s3fs") | ||
s3 = s3fs.S3FileSystem() | ||
kw = dict(filesystem=s3) | ||
check_round_trip(df_compat, pa, path="pandas-test/pyarrow.parquet", | ||
read_kwargs=kw, write_kwargs=kw) | ||
|
||
def test_s3_roundtrip_explicit_fs(self, df_compat, s3_resource, pa): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Naming nitpick: I think it's the test above that has the explicit filesystem passed? |
||
# GH #19134 | ||
check_round_trip(df_compat, pa, path="s3://pandas-test/pyarrow.parquet") | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would clarify the note a bit further, as now it sounds you need this for any kind of file operations, something like "for remote filesystems" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it handles other things too, just not "local" or "http(s)". That would make for a long comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then something like "filesystems other than local or http(s)" isn't too long I would say