Skip to content

AWS_S3_HOST environment variable no longer available #26195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakobkogler opened this issue Apr 23, 2019 · 7 comments · Fixed by #54279
Closed

AWS_S3_HOST environment variable no longer available #26195

jakobkogler opened this issue Apr 23, 2019 · 7 comments · Fixed by #54279
Assignees
Labels
Docs good first issue IO Network Local or Cloud (AWS, GCS, etc.) IO Issues

Comments

@jakobkogler
Copy link

In the What's New - Other enhancements section from the 18.0 release it says that you can define the environment variable AWS_S3_HOST.

Therefore, when setting this environment variable, together with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY I expected the following code to work.

import pandas as pd
df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]))
df.to_csv(fs.open('s3://bucket/key', mode='w'))

The script crashes with:

Exception ignored in: <bound method S3File.__del__ of <S3File bucket/key>>
Traceback (most recent call last):
  File "/home/xxx/.local/share/virtualenvs/xxx/lib/python3.6/site-packages/s3fs/core.py", line 1518, in __del__
    self.close()
  File "/home/xxx/.local/share/virtualenvs/xxx/lib/python3.6/site-packages/s3fs/core.py", line 1496, in close
    raise_from(IOError('Write failed: %s' % self.path), e)
  File "<string>", line 3, in raise_from
OSError: Write failed: bucket/key

This is the exact same error as when I don't specify the variable AWS_S3_HOST at all.

I managed to get a working solution by directly interacting directly with s3fs:

import os
import pandas as pd
import s3fs

fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': os.environ["AWS_S3_HOST"]})
df = pd.DataFrame(dict(a=[1, 2, 3], b=[4, 5, 6]))
df.to_csv(fs.open('s3://bucket/key', mode='w'))

Looks like this feature got (accidentally?) removed in https://github.com/pandas-dev/pandas/commit/dc4b0708f36b971f71890bfdf830d9a5dc019c7b#diff-a37b395bed03f0404dec864a4529c97dL94, when they switched from boto to s3fs.
Is it planned to support this environment variable again?
I'm sure that many companies, e.g. like the one I work for, hosts their own s3 server, e.g. with Minio, and don't use Amazon.

@WillAyd
Copy link
Member

WillAyd commented Apr 23, 2019

This is pretty similar to #16662 - I don't think this is something we'd generally offer support for so using the file system directly as you have in your solution is probably the way to go.

@TomAugspurger in case he has differing thoughts

@gfyoung gfyoung added the IO Data IO issues that don't fit into a more specific label label Apr 25, 2019
@xieqihui
Copy link

xieqihui commented Oct 17, 2019

+1 Support for this feature. It would be good to support customised S3 endpoint as it is widely used.

I have created a pull request to support this.

@jbrockmendel jbrockmendel added IO Network Local or Cloud (AWS, GCS, etc.) IO Issues and removed IO Data IO issues that don't fit into a more specific label labels Dec 12, 2019
@MarcoGorelli
Copy link
Member

Closing because, as I gather from the discussion in #29050, this would be out-of-scope for pandas (please ping me if I've misunderstood and I'll reopen)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 31, 2020

You can do this now with storage_options. See how we do it in our tests:

@pytest.fixture
def s3so(worker_id):
worker_id = "5" if worker_id == "master" else worker_id.lstrip("gw")
return dict(client_kwargs={"endpoint_url": f"http://127.0.0.1:555{worker_id}/"})
@pytest.fixture(scope="session")
def s3_base(worker_id):
"""
Fixture for mocking S3 interaction.
Sets up moto server in separate process
"""
pytest.importorskip("s3fs")
pytest.importorskip("boto3")
requests = pytest.importorskip("requests")
with tm.ensure_safe_environment_variables():
# temporary workaround as moto fails for botocore >= 1.11 otherwise,
# see https://github.com/spulec/moto/issues/1924 & 1952
os.environ.setdefault("AWS_ACCESS_KEY_ID", "foobar_key")
os.environ.setdefault("AWS_SECRET_ACCESS_KEY", "foobar_secret")
pytest.importorskip("moto", minversion="1.3.14")
pytest.importorskip("flask") # server mode needs flask too
# Launching moto in server mode, i.e., as a separate process
# with an S3 endpoint on localhost
worker_id = "5" if worker_id == "master" else worker_id.lstrip("gw")
endpoint_port = f"555{worker_id}"
endpoint_uri = f"http://127.0.0.1:{endpoint_port}/"

If anyone wants to turn that into a doc example we'd welcome it.

@simonjayhawkins simonjayhawkins added Docs good first issue and removed Enhancement IO Network Local or Cloud (AWS, GCS, etc.) IO Issues labels Aug 31, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 31, 2020
@mroeschke mroeschke added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Jul 2, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@kvn4
Copy link
Contributor

kvn4 commented Jul 19, 2023

take

@kvn4
Copy link
Contributor

kvn4 commented Jul 19, 2023

Hi @TomAugspurger I just wanted to clarify a few things as I am taking the issue. I was thinking of updating this in this pandas user_guide section does this look like the right place to you? Also, is the fix that we are setting the endpoint url in the s3so function or that we are explicitly defining the environment variables (endpoint_port and endpoint_uri) in the s3_base function? In addition, it is not quite clear to me why the worker_id code is duplicated.

@TomAugspurger
Copy link
Contributor

I think that's the right spot to document this.

If I understand things, this is documented in s3fs at https://s3fs.readthedocs.io/en/latest/index.html?highlight=host#s3-compatible-storage. So we're likely just summarizing that section and linking to it.

kvn4 added a commit to kvn4/pandas that referenced this issue Jul 28, 2023
mroeschke pushed a commit that referenced this issue Jul 28, 2023
#54279)

* Added AWS_S3_HOST environment variable workaround to user_guide for issue #26195

* Updated wording for issue #26195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs good first issue IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants