ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452

jrderuiter · 2020-04-10T10:42:26Z

Is your feature request related to a problem?

Currently pandas has some support for S3 and GCS using the pandas.io.{gcs,s3} modules, which are based on S3fs and gcsfs.

It seems like we could easily broaden the support for different filesystems by leveraging the fsspec library (https://pypi.org/project/fsspec/) and its interface implementations (see https://github.com/intake/filesystem_spec/blob/master/fsspec/registry.py for some examples) to read/write files in pandas.

This way, we would also be able to use filesystems such as Azure-based storage systems directly from pandas.

Describe the solution you'd like

I'd like to be able to use the different file systems supported by fsspec in pandas with something like:

import pandas as pd

df1 = pd.read_csv("abfs://my_container/my_file.csv")
df1.to_json("file:///some/local/path.json") # Also works without file:// prefix.

df2 = pd.read_csv("s3://my_bucket/my_file.csv")
...

API breaking implications

In principle, it looks as if we could cover most of the work by adapting get_filepath_or_buffer in pandas/io/common.py to use fsspec. We would of course have to test if fsspec doesn't break anything compared to the current implementations.

One challenge is that some storage systems require extra arguments (called storage options in fsspec). For example, Azure blob requires the user to pass two storage options (account_name and account_key) to be able to access the storage. We would need to consider how to pass these options to the correct methods, either by (a) setting these options globally for a given type of storage or (b) passing the options through the pd.read_* functions and pd.DataFrame.to_* methods.

Describe alternatives you've considered

This seems like a change that would have a small impact, as pandas already uses S3fs and gcsfs, which are both implementations of the broader fsspec interface. It should also provide support for a great number of different filesystems with minimal changes, compared to adding support for each filesystem ourselves. As such, I would think it is the preferred approach.

Another approach would be to add support for each additional filesystem as it comes along. This however would require adding and maintaining code for each filesystem, which seems less preferable.

Additional context

I recently implemented similar wrappers for pandas code at one of our clients, and am therefore somewhat familiar with fsspec etc. I would be happy to see if we can contribute these ideas + code to the pandas project.

The text was updated successfully, but these errors were encountered:

jrderuiter · 2020-04-10T11:20:11Z

Keen to hear any thoughts!

TomAugspurger · 2020-04-10T12:57:03Z

One challenge is that pandas wouldn't be able to take a hard dependency on fsspec. Whatever solution we end up with we'll need to keep that in mind.

IIRC, the s3- and gcs-specific bits of code are around URL discovery (taking a url like s3:// and deciding to treat it specially) and the requirement that the buffer returned by _get_filepath_or_buffer needs to be wrapped in a TextIOWrapper, since s3fs / gcsfs / fsspec only deal with bytes.

cc @martindurant just for your awareness.

martindurant · 2020-04-10T13:03:08Z

Thanks for posting @jrderuiter , and for letting me know @TomAugspurger

Sorry, what does TestIOWrapper do?

Note the existence of fsspec.open, which allows you to get a file-like object from a target in a number of ways (e.g., with local caching, decompression, text decoding).

I can well understand pandas not wanting to depend upon fsspec, although it is small and has no deps of its own. It would be good to massively increase the fsspec test suite, e.g., for all encoding options.

TomAugspurger · 2020-04-10T13:04:47Z

Pandas parsers work on text streams I think. So when we're given a ByteStream we wrap it in TextIOWrapper: https://docs.python.org/3/library/io.html#io.TextIOWrapper

…

On Fri, Apr 10, 2020 at 8:03 AM Martin Durant ***@***.***> wrote: Thanks for posting @jrderuiter <https://github.com/jrderuiter> , and for letting me know @TomAugspurger <https://github.com/TomAugspurger> Sorry, what does TestIOWrapper do? Note the existence of fsspec.open, which allows you to get a file-like object from a target in a number of ways (e.g., with local caching, decompression, text decoding). I can well understand pandas not wanting to depend upon fsspec, although it is small and has no deps of its own. It would be good to massively increase the fsspec test suite, e.g., for all encoding options. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33452 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIQVQD4UPOQV5KAKZQ3RL4KJVANCNFSM4MFL466Q> .

martindurant · 2020-04-10T13:30:54Z

You wrote test, not text :)

So using fsspec.open with a text mode does this too.

jrderuiter · 2020-04-10T17:22:41Z

@TomAugspurger What do you mean exactly with your comment

One challenge is that pandas wouldn't be able to take a hard dependency on fsspec.

Currently pandas is also relying on s3fs and gcsfs to open files on these file systems as well right? Or do you mean that pandas should be able to work normally with local files without having fsspec installed? Or does it have to do with the quality of the fsspec package? (Which is closely related to gcsfs and s3fs.)

jrderuiter · 2020-04-10T17:26:52Z

In essence most of the integration would indeed involve url handling (to identify which file system should be used and then using open on the fsspec file system. We could also use fsspec.open directly, which lets fsspec figure out the file system and also lets you specify options for reading in text mode and handling things like compression.

TomAugspurger · 2020-04-10T17:38:07Z

Those are optional dependencies. We couldn’t call for example fsspec.open unconditionally.

…

On Apr 10, 2020, at 12:22, Julian de Ruiter ***@***.***> wrote: @TomAugspurger What do you mean exactly with your comment One challenge is that pandas wouldn't be able to take a hard dependency on fsspec. Currently pandas is also relying on s3fs and gcsfs to open files on these file systems as well right? Or do you mean that pandas should be able to work normally with local files without having fsspec installed? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

martindurant · 2020-04-10T18:18:15Z

The inclusion of any protocol specifier, "://" could be taken to mean "use fsspec", although this would mean that "path" and "file://" don't use the same code.

jreback · 2020-04-10T18:44:01Z

I don’t think it would be hard to integrate this

could start by using instead of gcfs and s3fs and see what needs changing

TomAugspurger · 2020-04-10T20:18:36Z

although this would mean that "path" and "file://" don't use the same code.

I'd hope that we get back on the same code path pretty quickly though, right? We'd take a detour through fsspec to get the file object, but once we have that we're back to where we would have been.

martindurant · 2020-04-10T20:24:57Z

Right, hope so :)

jrderuiter · 2020-04-10T21:03:22Z

I don’t think it would be hard to integrate this

could start by using instead of gcfs and s3fs and see what needs changing

I’ll see if I can start drafting a PR :)

jrderuiter · 2020-04-14T20:51:23Z

Just added an initial PR that changes get_filepath_or_buffer to use fsspec for non-file URLs supported by (known) fsspec implementations.

Unfortunately we couldn't use fsspec.open for opening files, as the method returns an OpenFile instance which is not a file-like object. We could get an underlying file-like object by calling open on the OpenFile, but this file handle would be closed before we could ever read/write from/to the file, as the OpenFile parent object gets collected when it goes out of scope.

One thing that's still missing is how to handle the storage_options arguments to fsspec. For some filesystems we need to be able to pass storage_options to provide required arguments (such as account name/key for Azure Blob). Can we add this argument to the read_* functions and to_* methods and pass it through to get_filepath_or_buffer? Or would another approach be preferable? (Setting the options globally for a given protocol could be possible, but maybe also not very elegant.)

Some tests in pandas/tests/io/test_gcs.py seem to be failing due to the monkey patching done in the tests, running the tests individually seems to work fine. Any other tests I should focus on? (The test_s3.py tests seem very superficial btw, we could consider adding some 'real' tests using moto).

martindurant · 2020-04-14T21:05:26Z

as the method returns an OpenFile instance which is not a file-like object

In our examples, you typically see

with fsspec.open(url, mode, compression, ...) as f:
    df = pd.read_csv(f, ...)

(i.e., the context makes the OpenFile into a concrete file-like)

For some filesystems we need to be able to pass storage_options to provide required arguments (such as account name/key for Azure Blob).

How is this done now for s3fs?

jrderuiter · 2020-04-14T21:09:56Z

Yeah, but pandas just expects a file object to be returned by get_filepath_or_buffer, which the calling code will close for us. This means we can’t use the context manager approach.

Currently no storage options are passed to s3fs, I guess because boto3 allows you to specify things like access keys etc using env variables and aws configs.

martindurant · 2020-04-14T21:15:22Z

So you would need to keep hold of the OpenFile object, or have the generated file-like to keep a (circular) reference to it? That's a bit annoying.

True, the defaults for s3 will often work... GCS also has ways to define the default, but I'm not sure about azure. Some backends (like ssh) will always need parameters, although some of these can be encoded into the URL.

jrderuiter · 2020-04-15T08:57:05Z

Yeah, I was also a bit disappointed. Adding a circular reference might workaround the issue, but it's not really elegant :(

Still, it would be nice to be able to explicitly pass these options somewhere. I now also have some code that interfaces with multiple Azure Blob stores or multiple S3 buckets with different credentials. Setting these things globally not really sufficient in this case.

martindurant · 2020-04-15T13:01:37Z

In dask we have a single optional storage_options for the various IO functions, and fsspec itself refers to this convention in places.

martindurant · 2020-04-16T12:38:11Z

Well... the simplest thing you could do is

def read_csv(url, storage_options, ...):
    if isinstance(url, str) and "://" in url:
        with fsspec.open(url, **storage_options) as f:
            return read_csv(f, ...)

but unfortunately there are many read functions.

So I am thinking that we could change OpenFile.open to:

    def open(self):
        f = self.__enter__()
        f.close = self.close
        return f

which would keep things alive. This does make a circular reference, but OpenFile.close removes the reference to f from the OpenFile instance (and calling OpenFile.close again will have no effect).
This code will cause infinite recursion as written, however, since you can no longer reach the original version of f.close - but I think it's near!

TomAugspurger · 2020-04-16T16:34:00Z

cc @gfyoung if you have thoughts on the overall flow here.

gfyoung · 2020-04-16T17:20:53Z

Couple of points:

Allowing storage_options isn't super problematic here. We could easily add another parameter called storage_options to read_csv that accepts a dict. Perhaps there's a better way so that we don't add yet another parameter to read_csv, but this would be the simplest of course.
The issue of operating on an OpenFile object is a slightly more problematic one here for some of the reasons described above.

One option is that could we make OpenFile file-like as well e.g. we call OpenFile.open() and essentially have the OpenFile object then dispatch IO-methods to the underlying file object(s)? You sort of do this already with __fspath__.

martindurant · 2020-04-16T21:09:38Z

Previous experience of magically/silently passing method calls through to another object has proven error-prone for me; I would much rather put all the logic into ``OpenFile.open`, if possible.

You sort of do this already with fspath.

That is exclusive to local files (makes no sense otherwise)

martindurant · 2020-04-16T21:10:20Z

We could easily add another parameter called storage_options to read_csv that accepts a dict

... and all the other functions/methods that deal with file IO?

gfyoung · 2020-04-16T21:28:52Z

I would much rather put all the logic into ``OpenFile.open`, if possible.

Not sure I follow you here. pandas would need to operate on a file-like (and special-casing for fsspec is not ideal for us). How would putting all the logic into open address this?

That is exclusive to local files (makes no sense otherwise)

Hence why I said "sort of"

... and all the other functions/methods that deal with file IO?

Yes. Not sure how we would pass them otherwise.

martindurant · 2020-04-17T12:53:45Z

I meant like this, which does seem to successfully clean up the file-like and the OpenFile, so long as the file-like has close() called. You would need to use a weakref (to f.close?) to break the reference cycle to clean up when f is garbage collected instead.

--- a/fsspec/core.py
+++ b/fsspec/core.py
@@ -72,6 +73,7 @@ class OpenFile(object):
         self.errors = errors
         self.newline = newline
         self.fobjects = []
+        self.ref_close = None

     def __reduce__(self):
         return (
@@ -126,10 +128,16 @@ class OpenFile(object):
         The file should be explicitly closed to avoid enclosed open file
         instances persisting
         """
-        return self.__enter__()
+        f = self.__enter__()
+        self.ref_close = f.close
+        f.close = self.close
+        return f

     def close(self):
         """Close all encapsulated file objects"""
+        if self.ref_close:
+            self.fobjects[-1].close = self.ref_close
+            self.ref_close = None

jrderuiter added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2020

charlesdong1991 added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 10, 2020

jrderuiter mentioned this issue Apr 14, 2020

[WIP] Add remote file io using fsspec. #33549

Closed

5 tasks

alimcmaster1 mentioned this issue Apr 18, 2020

use s3fs authentication if provided #33639

Closed

martindurant mentioned this issue May 18, 2020

TypeError when writing to S3 with partition_cols dask/fastparquet#503

Open

martindurant mentioned this issue May 25, 2020

ENH: add fsspec support #34266

Merged

5 tasks

jreback added this to the 1.1 milestone Jun 11, 2020

jorisvandenbossche closed this as completed in #34266 Jun 23, 2020

chrisroat mentioned this issue Jan 5, 2021

POC: filepath GCS support datajoint/datajoint-python#837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452

ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452

jrderuiter commented Apr 10, 2020 •

edited

Loading

jrderuiter commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020 •

edited

Loading

martindurant commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020 via email

martindurant commented Apr 10, 2020

jrderuiter commented Apr 10, 2020 •

edited

Loading

jrderuiter commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020 via email

martindurant commented Apr 10, 2020

jreback commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020

martindurant commented Apr 10, 2020

jrderuiter commented Apr 10, 2020

jrderuiter commented Apr 14, 2020

martindurant commented Apr 14, 2020

jrderuiter commented Apr 14, 2020 •

edited

Loading

martindurant commented Apr 14, 2020

jrderuiter commented Apr 15, 2020 •

edited

Loading

martindurant commented Apr 15, 2020

martindurant commented Apr 16, 2020 •

edited

Loading

TomAugspurger commented Apr 16, 2020

gfyoung commented Apr 16, 2020

martindurant commented Apr 16, 2020

martindurant commented Apr 16, 2020

gfyoung commented Apr 16, 2020 •

edited

Loading

martindurant commented Apr 17, 2020 •

edited

Loading

ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452

ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452

Comments

jrderuiter commented Apr 10, 2020 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

jrderuiter commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020 • edited Loading

martindurant commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020 via email

martindurant commented Apr 10, 2020

jrderuiter commented Apr 10, 2020 • edited Loading

jrderuiter commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020 via email

martindurant commented Apr 10, 2020

jreback commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020

martindurant commented Apr 10, 2020

jrderuiter commented Apr 10, 2020

jrderuiter commented Apr 14, 2020

martindurant commented Apr 14, 2020

jrderuiter commented Apr 14, 2020 • edited Loading

martindurant commented Apr 14, 2020

jrderuiter commented Apr 15, 2020 • edited Loading

martindurant commented Apr 15, 2020

martindurant commented Apr 16, 2020 • edited Loading

TomAugspurger commented Apr 16, 2020

gfyoung commented Apr 16, 2020

martindurant commented Apr 16, 2020

martindurant commented Apr 16, 2020

gfyoung commented Apr 16, 2020 • edited Loading

martindurant commented Apr 17, 2020 • edited Loading

jrderuiter commented Apr 10, 2020 •

edited

Loading

TomAugspurger commented Apr 10, 2020 •

edited

Loading

jrderuiter commented Apr 10, 2020 •

edited

Loading

jrderuiter commented Apr 14, 2020 •

edited

Loading

jrderuiter commented Apr 15, 2020 •

edited

Loading

martindurant commented Apr 16, 2020 •

edited

Loading

gfyoung commented Apr 16, 2020 •

edited

Loading

martindurant commented Apr 17, 2020 •

edited

Loading