Skip to content

[WIP] Add remote file io using fsspec. #33549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 34 additions & 10 deletions pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,23 @@ def urlopen(*args, **kwargs):
return urllib.request.urlopen(*args, **kwargs)


def is_fsspec_url(url) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u type url

"""
Returns true if fsspec is installed and the URL references a known
fsspec filesystem.
"""

if not isinstance(url, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don’t need this

return False

try:
from fsspec.registry import known_implementations
scheme = parse_url(url).scheme
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps split_protocol, or even just "://" in url or "::" in url

return scheme != "file" and scheme in known_implementations
except ImportError:
return False


def get_filepath_or_buffer(
filepath_or_buffer: FilePathOrBuffer,
encoding: Optional[str] = None,
Expand Down Expand Up @@ -194,19 +211,26 @@ def get_filepath_or_buffer(
req.close()
return reader, encoding, compression, True

if is_s3_url(filepath_or_buffer):
from pandas.io import s3
if is_fsspec_url(filepath_or_buffer):
import fsspec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be import_optional_dependency("fsspec"). Make sure we have a nice error message on the failure.

scheme = parse_url(filepath_or_buffer).scheme
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These three lines can be done with fsspec.open, except for the garbage collection issue, which should take care of encoding and compression too.

filesystem = fsspec.filesystem(scheme)
file_obj = filesystem.open(filepath_or_buffer, mode=mode or "rb")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass through the encoding? to .open() You will potentially also fix this if you do so: #26124 :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using filesystem.open directly, I would for now always open binary and use the existing encoding within pandas.

return file_obj, encoding, compression, True

return s3.get_filepath_or_buffer(
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
)
# if is_s3_url(filepath_or_buffer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete all these.

# from pandas.io import s3

if is_gcs_url(filepath_or_buffer):
from pandas.io import gcs
# return s3.get_filepath_or_buffer(
# filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
# )

return gcs.get_filepath_or_buffer(
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
)
# if is_gcs_url(filepath_or_buffer):
# from pandas.io import gcs

# return gcs.get_filepath_or_buffer(
# filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
# )

if isinstance(filepath_or_buffer, (str, bytes, mmap.mmap)):
return _expand_user(filepath_or_buffer), None, compression, False
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/io/test_gcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def test_read_csv_gcs(monkeypatch):
)

class MockGCSFileSystem:
def open(*args):
def open(self, path, mode, *args):
return StringIO(df1.to_csv(index=False))

monkeypatch.setattr("gcsfs.GCSFileSystem", MockGCSFileSystem)
Expand All @@ -51,7 +51,7 @@ def test_to_csv_gcs(monkeypatch):
s = StringIO()

class MockGCSFileSystem:
def open(*args):
def open(self, path, mode, *args):
return s

monkeypatch.setattr("gcsfs.GCSFileSystem", MockGCSFileSystem)
Expand Down