Skip to content

DOC: s3fs is required when using read_csv with an S3 URI #35206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #6 ...
MartinThoma opened this issue Jul 10, 2020 · 6 comments
Open
Tracked by #6 ...

DOC: s3fs is required when using read_csv with an S3 URI #35206

MartinThoma opened this issue Jul 10, 2020 · 6 comments
Labels
Docs IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues

Comments

@MartinThoma
Copy link

Location of the documentation

pandas.read_csv

Documentation problem

I've just noticed that s3fs is required when you read an URL from s3. While it is documented that you can read from S3, the implication that you need to install an extra is not documented.

Also, it would be nice if this was a pandas extra in setup.py (e.g. s3).

@MartinThoma MartinThoma added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 10, 2020
@jorisvandenbossche
Copy link
Member

The user guide mentions it: https://pandas.pydata.org/docs/user_guide/io.html#reading-remote-files, and the install guide as well: https://pandas.pydata.org/docs/getting_started/install.html#optional-dependencies

I think it would probably be too much to list all optional dependencies in the read_csv docstring as well (S3 is one, but eg Azure or Google Cloud need other optional deps), but we should maybe mention it in general that additional deps might be needed and link to one the other places where this is explained?

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 10, 2020
@MartinThoma
Copy link
Author

I wasn't aware that there are even more 😱

we should maybe mention it in general that additional deps might be needed

Sounds good! Should I make a PR?

@jorisvandenbossche
Copy link
Member

Yes, PR very welcome!

@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jul 11, 2020
@abdoulayegk
Copy link

hello, can I make a PR cuz till now nobody makes it yet?

@MartinThoma
Copy link
Author

@abdoulayegk Oops, sorry, I forgot. Please go ahead if you want to take care of that :-)

@alecglassford
Copy link

  1. Perhaps this has already been noted, but it looks like fsspec also needs to be installed in addition to s3fs or gcsfs (related PR: ENH: add fsspec support #34266). This is reflected in the optional dependencies list but it's not necessarily obvious on first glance. It might be nice if the relevant rows noted this requirement, for example (my addition in bold):

    Dependency Minimum Version Notes
    gcsfs 0.6.0 Google Cloud Storage access (must be used with fsspec)
    s3fs 0.4.0 Amazon S3 access (must be used with fsspec)
  2. If you're adding a link in the read_csv docstring to the optional dependencies list, it likely makes sense to add an identical link to the docstrings of other pandas.read_{format} methods. I'm not sure it applies to all of them, but at least pandas.read_json and pandas.read_excel.

  3. I couldn't find a list of all the supported filesystems anywhere; the most comprehensive listing I found is this release note. Given that fsspec supports many filesystems, maybe it's not feasible to list them all (and keep up with a potentially growing list); however, the reading remote files section of the IO doc could be updated to link to the fsspec documentation for users to learn about additional compatible filesystems. (Unfortunately, I couldn't find a more concise list of supported filesystems in the fsspec documentation than the source code that I just linked to.)

Sorry if these are beyond the scope of this issue! They seemed closely related, so I thought that I would note these gaps here rather than create a new issue.

@mroeschke mroeschke added IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues labels Aug 8, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

No branches or pull requests

6 participants