-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Use fsspec for reading/writing from/to S3, GCS, Azure Blob, etc. #33452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Keen to hear any thoughts! |
One challenge is that pandas wouldn't be able to take a hard dependency on fsspec. Whatever solution we end up with we'll need to keep that in mind. IIRC, the s3- and gcs-specific bits of code are around URL discovery (taking a url like cc @martindurant just for your awareness. |
Thanks for posting @jrderuiter , and for letting me know @TomAugspurger Sorry, what does Note the existence of I can well understand pandas not wanting to depend upon fsspec, although it is small and has no deps of its own. It would be good to massively increase the fsspec test suite, e.g., for all encoding options. |
Pandas parsers work on text streams I think. So when we're given a
ByteStream we wrap it in TextIOWrapper:
https://docs.python.org/3/library/io.html#io.TextIOWrapper
…On Fri, Apr 10, 2020 at 8:03 AM Martin Durant ***@***.***> wrote:
Thanks for posting @jrderuiter <https://github.com/jrderuiter> , and for
letting me know @TomAugspurger <https://github.com/TomAugspurger>
Sorry, what does TestIOWrapper do?
Note the existence of fsspec.open, which allows you to get a file-like
object from a target in a number of ways (e.g., with local caching,
decompression, text decoding).
I can well understand pandas not wanting to depend upon fsspec, although
it is small and has no deps of its own. It would be good to massively
increase the fsspec test suite, e.g., for all encoding options.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33452 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQVQD4UPOQV5KAKZQ3RL4KJVANCNFSM4MFL466Q>
.
|
You wrote test, not text :) So using |
@TomAugspurger What do you mean exactly with your comment
Currently pandas is also relying on s3fs and gcsfs to open files on these file systems as well right? Or do you mean that pandas should be able to work normally with local files without having fsspec installed? Or does it have to do with the quality of the fsspec package? (Which is closely related to gcsfs and s3fs.) |
In essence most of the integration would indeed involve url handling (to identify which file system should be used and then using open on the fsspec file system. We could also use fsspec.open directly, which lets fsspec figure out the file system and also lets you specify options for reading in text mode and handling things like compression. |
Those are optional dependencies. We couldn’t call for example fsspec.open unconditionally.
… On Apr 10, 2020, at 12:22, Julian de Ruiter ***@***.***> wrote:
@TomAugspurger What do you mean exactly with your comment
One challenge is that pandas wouldn't be able to take a hard dependency on fsspec.
Currently pandas is also relying on s3fs and gcsfs to open files on these file systems as well right? Or do you mean that pandas should be able to work normally with local files without having fsspec installed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The inclusion of any protocol specifier, "://" could be taken to mean "use fsspec", although this would mean that "path" and "file://" don't use the same code. |
I don’t think it would be hard to integrate this could start by using instead of gcfs and s3fs and see what needs changing |
I'd hope that we get back on the same code path pretty quickly though, right? We'd take a detour through |
Right, hope so :) |
I’ll see if I can start drafting a PR :) |
Just added an initial PR that changes get_filepath_or_buffer to use fsspec for non-file URLs supported by (known) fsspec implementations. Unfortunately we couldn't use One thing that's still missing is how to handle the storage_options arguments to fsspec. For some filesystems we need to be able to pass storage_options to provide required arguments (such as account name/key for Azure Blob). Can we add this argument to the read_* functions and to_* methods and pass it through to get_filepath_or_buffer? Or would another approach be preferable? (Setting the options globally for a given protocol could be possible, but maybe also not very elegant.) Some tests in pandas/tests/io/test_gcs.py seem to be failing due to the monkey patching done in the tests, running the tests individually seems to work fine. Any other tests I should focus on? (The test_s3.py tests seem very superficial btw, we could consider adding some 'real' tests using moto). |
In our examples, you typically see
(i.e., the context makes the
How is this done now for s3fs? |
Yeah, but pandas just expects a file object to be returned by get_filepath_or_buffer, which the calling code will close for us. This means we can’t use the context manager approach. Currently no storage options are passed to s3fs, I guess because boto3 allows you to specify things like access keys etc using env variables and aws configs. |
So you would need to keep hold of the OpenFile object, or have the generated file-like to keep a (circular) reference to it? That's a bit annoying. True, the defaults for s3 will often work... GCS also has ways to define the default, but I'm not sure about azure. Some backends (like ssh) will always need parameters, although some of these can be encoded into the URL. |
Yeah, I was also a bit disappointed. Adding a circular reference might workaround the issue, but it's not really elegant :( Still, it would be nice to be able to explicitly pass these options somewhere. I now also have some code that interfaces with multiple Azure Blob stores or multiple S3 buckets with different credentials. Setting these things globally not really sufficient in this case. |
In dask we have a single optional storage_options for the various IO functions, and fsspec itself refers to this convention in places. |
Well... the simplest thing you could do is
but unfortunately there are many read functions. So I am thinking that we could change
which would keep things alive. This does make a circular reference, but |
cc @gfyoung if you have thoughts on the overall flow here. |
Couple of points:
One option is that could we make |
Previous experience of magically/silently passing method calls through to another object has proven error-prone for me; I would much rather put all the logic into ``OpenFile.open`, if possible.
That is exclusive to local files (makes no sense otherwise) |
... and all the other functions/methods that deal with file IO? |
Not sure I follow you here.
Hence why I said "sort of"
Yes. Not sure how we would pass them otherwise. |
I meant like this, which does seem to successfully clean up the file-like and the OpenFile, so long as the file-like has --- a/fsspec/core.py
+++ b/fsspec/core.py
@@ -72,6 +73,7 @@ class OpenFile(object):
self.errors = errors
self.newline = newline
self.fobjects = []
+ self.ref_close = None
def __reduce__(self):
return (
@@ -126,10 +128,16 @@ class OpenFile(object):
The file should be explicitly closed to avoid enclosed open file
instances persisting
"""
- return self.__enter__()
+ f = self.__enter__()
+ self.ref_close = f.close
+ f.close = self.close
+ return f
def close(self):
"""Close all encapsulated file objects"""
+ if self.ref_close:
+ self.fobjects[-1].close = self.ref_close
+ self.ref_close = None |
Is your feature request related to a problem?
Currently pandas has some support for S3 and GCS using the pandas.io.{gcs,s3} modules, which are based on S3fs and gcsfs.
It seems like we could easily broaden the support for different filesystems by leveraging the fsspec library (https://pypi.org/project/fsspec/) and its interface implementations (see https://github.com/intake/filesystem_spec/blob/master/fsspec/registry.py for some examples) to read/write files in pandas.
This way, we would also be able to use filesystems such as Azure-based storage systems directly from pandas.
Describe the solution you'd like
I'd like to be able to use the different file systems supported by fsspec in pandas with something like:
API breaking implications
In principle, it looks as if we could cover most of the work by adapting get_filepath_or_buffer in pandas/io/common.py to use fsspec. We would of course have to test if fsspec doesn't break anything compared to the current implementations.
One challenge is that some storage systems require extra arguments (called storage options in fsspec). For example, Azure blob requires the user to pass two storage options (account_name and account_key) to be able to access the storage. We would need to consider how to pass these options to the correct methods, either by (a) setting these options globally for a given type of storage or (b) passing the options through the pd.read_* functions and pd.DataFrame.to_* methods.
Describe alternatives you've considered
This seems like a change that would have a small impact, as pandas already uses S3fs and gcsfs, which are both implementations of the broader fsspec interface. It should also provide support for a great number of different filesystems with minimal changes, compared to adding support for each filesystem ourselves. As such, I would think it is the preferred approach.
Another approach would be to add support for each additional filesystem as it comes along. This however would require adding and maintaining code for each filesystem, which seems less preferable.
Additional context
I recently implemented similar wrappers for pandas code at one of our clients, and am therefore somewhat familiar with fsspec etc. I would be happy to see if we can contribute these ideas + code to the pandas project.
The text was updated successfully, but these errors were encountered: