Skip to content

ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jrderuiter opened this issue Apr 10, 2020 · 26 comments · Fixed by #34266
Closed

ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452

jrderuiter opened this issue Apr 10, 2020 · 26 comments · Fixed by #34266
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@jrderuiter
Copy link

jrderuiter commented Apr 10, 2020

Is your feature request related to a problem?

Currently pandas has some support for S3 and GCS using the pandas.io.{gcs,s3} modules, which are based on S3fs and gcsfs.

It seems like we could easily broaden the support for different filesystems by leveraging the fsspec library (https://pypi.org/project/fsspec/) and its interface implementations (see https://github.com/intake/filesystem_spec/blob/master/fsspec/registry.py for some examples) to read/write files in pandas.

This way, we would also be able to use filesystems such as Azure-based storage systems directly from pandas.

Describe the solution you'd like

I'd like to be able to use the different file systems supported by fsspec in pandas with something like:

import pandas as pd

df1 = pd.read_csv("abfs://my_container/my_file.csv")
df1.to_json("file:///some/local/path.json") # Also works without file:// prefix.

df2 = pd.read_csv("s3://my_bucket/my_file.csv")
...

API breaking implications

In principle, it looks as if we could cover most of the work by adapting get_filepath_or_buffer in pandas/io/common.py to use fsspec. We would of course have to test if fsspec doesn't break anything compared to the current implementations.

One challenge is that some storage systems require extra arguments (called storage options in fsspec). For example, Azure blob requires the user to pass two storage options (account_name and account_key) to be able to access the storage. We would need to consider how to pass these options to the correct methods, either by (a) setting these options globally for a given type of storage or (b) passing the options through the pd.read_* functions and pd.DataFrame.to_* methods.

Describe alternatives you've considered

This seems like a change that would have a small impact, as pandas already uses S3fs and gcsfs, which are both implementations of the broader fsspec interface. It should also provide support for a great number of different filesystems with minimal changes, compared to adding support for each filesystem ourselves. As such, I would think it is the preferred approach.

Another approach would be to add support for each additional filesystem as it comes along. This however would require adding and maintaining code for each filesystem, which seems less preferable.

Additional context

I recently implemented similar wrappers for pandas code at one of our clients, and am therefore somewhat familiar with fsspec etc. I would be happy to see if we can contribute these ideas + code to the pandas project.

@jrderuiter jrderuiter added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2020
@charlesdong1991 charlesdong1991 added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2020
@jrderuiter
Copy link
Author

Keen to hear any thoughts!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 10, 2020

One challenge is that pandas wouldn't be able to take a hard dependency on fsspec. Whatever solution we end up with we'll need to keep that in mind.

IIRC, the s3- and gcs-specific bits of code are around URL discovery (taking a url like s3:// and deciding to treat it specially) and the requirement that the buffer returned by _get_filepath_or_buffer needs to be wrapped in a TextIOWrapper, since s3fs / gcsfs / fsspec only deal with bytes.

cc @martindurant just for your awareness.

@martindurant
Copy link
Contributor

Thanks for posting @jrderuiter , and for letting me know @TomAugspurger

Sorry, what does TestIOWrapper do?

Note the existence of fsspec.open, which allows you to get a file-like object from a target in a number of ways (e.g., with local caching, decompression, text decoding).

I can well understand pandas not wanting to depend upon fsspec, although it is small and has no deps of its own. It would be good to massively increase the fsspec test suite, e.g., for all encoding options.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 10, 2020 via email

@martindurant
Copy link
Contributor

You wrote test, not text :)

So using fsspec.open with a text mode does this too.

@jrderuiter
Copy link
Author

jrderuiter commented Apr 10, 2020

@TomAugspurger What do you mean exactly with your comment

One challenge is that pandas wouldn't be able to take a hard dependency on fsspec.

Currently pandas is also relying on s3fs and gcsfs to open files on these file systems as well right? Or do you mean that pandas should be able to work normally with local files without having fsspec installed? Or does it have to do with the quality of the fsspec package? (Which is closely related to gcsfs and s3fs.)

@jrderuiter
Copy link
Author

In essence most of the integration would indeed involve url handling (to identify which file system should be used and then using open on the fsspec file system. We could also use fsspec.open directly, which lets fsspec figure out the file system and also lets you specify options for reading in text mode and handling things like compression.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 10, 2020 via email

@martindurant
Copy link
Contributor

The inclusion of any protocol specifier, "://" could be taken to mean "use fsspec", although this would mean that "path" and "file://" don't use the same code.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

I don’t think it would be hard to integrate this

could start by using instead of gcfs and s3fs and see what needs changing

@TomAugspurger
Copy link
Contributor

although this would mean that "path" and "file://" don't use the same code.

I'd hope that we get back on the same code path pretty quickly though, right? We'd take a detour through fsspec to get the file object, but once we have that we're back to where we would have been.

@martindurant
Copy link
Contributor

Right, hope so :)

@jrderuiter
Copy link
Author

I don’t think it would be hard to integrate this

could start by using instead of gcfs and s3fs and see what needs changing

I’ll see if I can start drafting a PR :)

@jrderuiter
Copy link
Author

Just added an initial PR that changes get_filepath_or_buffer to use fsspec for non-file URLs supported by (known) fsspec implementations.

Unfortunately we couldn't use fsspec.open for opening files, as the method returns an OpenFile instance which is not a file-like object. We could get an underlying file-like object by calling open on the OpenFile, but this file handle would be closed before we could ever read/write from/to the file, as the OpenFile parent object gets collected when it goes out of scope.

One thing that's still missing is how to handle the storage_options arguments to fsspec. For some filesystems we need to be able to pass storage_options to provide required arguments (such as account name/key for Azure Blob). Can we add this argument to the read_* functions and to_* methods and pass it through to get_filepath_or_buffer? Or would another approach be preferable? (Setting the options globally for a given protocol could be possible, but maybe also not very elegant.)

Some tests in pandas/tests/io/test_gcs.py seem to be failing due to the monkey patching done in the tests, running the tests individually seems to work fine. Any other tests I should focus on? (The test_s3.py tests seem very superficial btw, we could consider adding some 'real' tests using moto).

@martindurant
Copy link
Contributor

as the method returns an OpenFile instance which is not a file-like object

In our examples, you typically see

with fsspec.open(url, mode, compression, ...) as f:
    df = pd.read_csv(f, ...)

(i.e., the context makes the OpenFile into a concrete file-like)

For some filesystems we need to be able to pass storage_options to provide required arguments (such as account name/key for Azure Blob).

How is this done now for s3fs?

@jrderuiter
Copy link
Author

jrderuiter commented Apr 14, 2020

Yeah, but pandas just expects a file object to be returned by get_filepath_or_buffer, which the calling code will close for us. This means we can’t use the context manager approach.

Currently no storage options are passed to s3fs, I guess because boto3 allows you to specify things like access keys etc using env variables and aws configs.

@martindurant
Copy link
Contributor

So you would need to keep hold of the OpenFile object, or have the generated file-like to keep a (circular) reference to it? That's a bit annoying.

True, the defaults for s3 will often work... GCS also has ways to define the default, but I'm not sure about azure. Some backends (like ssh) will always need parameters, although some of these can be encoded into the URL.

@jrderuiter
Copy link
Author

jrderuiter commented Apr 15, 2020

Yeah, I was also a bit disappointed. Adding a circular reference might workaround the issue, but it's not really elegant :(

Still, it would be nice to be able to explicitly pass these options somewhere. I now also have some code that interfaces with multiple Azure Blob stores or multiple S3 buckets with different credentials. Setting these things globally not really sufficient in this case.

@martindurant
Copy link
Contributor

In dask we have a single optional storage_options for the various IO functions, and fsspec itself refers to this convention in places.

@martindurant
Copy link
Contributor

martindurant commented Apr 16, 2020

Well... the simplest thing you could do is

def read_csv(url, storage_options, ...):
    if isinstance(url, str) and "://" in url:
        with fsspec.open(url, **storage_options) as f:
            return read_csv(f, ...)

but unfortunately there are many read functions.

So I am thinking that we could change OpenFile.open to:

    def open(self):
        f = self.__enter__()
        f.close = self.close
        return f

which would keep things alive. This does make a circular reference, but OpenFile.close removes the reference to f from the OpenFile instance (and calling OpenFile.close again will have no effect).
This code will cause infinite recursion as written, however, since you can no longer reach the original version of f.close - but I think it's near!

@TomAugspurger
Copy link
Contributor

cc @gfyoung if you have thoughts on the overall flow here.

@gfyoung
Copy link
Member

gfyoung commented Apr 16, 2020

Couple of points:

  • Allowing storage_options isn't super problematic here. We could easily add another parameter called storage_options to read_csv that accepts a dict. Perhaps there's a better way so that we don't add yet another parameter to read_csv, but this would be the simplest of course.

  • The issue of operating on an OpenFile object is a slightly more problematic one here for some of the reasons described above.

One option is that could we make OpenFile file-like as well e.g. we call OpenFile.open() and essentially have the OpenFile object then dispatch IO-methods to the underlying file object(s)? You sort of do this already with __fspath__.

@martindurant
Copy link
Contributor

Previous experience of magically/silently passing method calls through to another object has proven error-prone for me; I would much rather put all the logic into ``OpenFile.open`, if possible.

You sort of do this already with fspath.

That is exclusive to local files (makes no sense otherwise)

@martindurant
Copy link
Contributor

We could easily add another parameter called storage_options to read_csv that accepts a dict

... and all the other functions/methods that deal with file IO?

@gfyoung
Copy link
Member

gfyoung commented Apr 16, 2020

I would much rather put all the logic into ``OpenFile.open`, if possible.

Not sure I follow you here. pandas would need to operate on a file-like (and special-casing for fsspec is not ideal for us). How would putting all the logic into open address this?

That is exclusive to local files (makes no sense otherwise)

Hence why I said "sort of"

... and all the other functions/methods that deal with file IO?

Yes. Not sure how we would pass them otherwise.

@martindurant
Copy link
Contributor

martindurant commented Apr 17, 2020

I meant like this, which does seem to successfully clean up the file-like and the OpenFile, so long as the file-like has close() called. You would need to use a weakref (to f.close?) to break the reference cycle to clean up when f is garbage collected instead.

--- a/fsspec/core.py
+++ b/fsspec/core.py
@@ -72,6 +73,7 @@ class OpenFile(object):
         self.errors = errors
         self.newline = newline
         self.fobjects = []
+        self.ref_close = None

     def __reduce__(self):
         return (
@@ -126,10 +128,16 @@ class OpenFile(object):
         The file should be explicitly closed to avoid enclosed open file
         instances persisting
         """
-        return self.__enter__()
+        f = self.__enter__()
+        self.ref_close = f.close
+        f.close = self.close
+        return f

     def close(self):
         """Close all encapsulated file objects"""
+        if self.ref_close:
+            self.fobjects[-1].close = self.ref_close
+            self.ref_close = None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
6 participants