Support customised S3 servers endpoint URL #29050

xieqihui · 2019-10-17T10:42:31Z

Add support for customised S3 servers via checking the environment variable S3_ENDPOINT.

If set, the value of S3_ENDPOINT will be passed to the endpoint_url parameter for boto3.Session. It should be the complete URL to S3 service following the format: http(s)://{host}:{port}.

This feature is useful for companies who use their own S3 servers like MinIO, Ceph etc, as mentioned in issue #26195

MarcoGorelli · 2019-10-17T12:47:29Z

pandas/io/s3.py

+    # the format: http(s)://{host}:{port}
+    s3_endpoint = os.environ.get('S3_ENDPOINT')
+    if s3_endpoint:
+        client_kwargs = {'endpoint_url': s3_endpoint}


Hi xieqihui :) I believe there needs to be a type annotation, such as

client_kwargs: Optional[Dict[str, str]] = {"endpoint_url": s3_endpoint}

Thanks for the reminder. I have added the type annotation and format the code.

MarcoGorelli · 2019-10-17T12:53:44Z

pandas/io/s3.py

+    # Support customised S3 servers endpoint URL via environment variable
+    # The S3_ENDPOINT should be the complete URL to S3 service following
+    # the format: http(s)://{host}:{port}
+    s3_endpoint = os.environ.get('S3_ENDPOINT')


Part of the CI process checks that the code would be left unchanged by the black autoformatter.
Try running

black pandas/io/s3.py --diff

to see the changes (and leave out the --diff flag to apply them)

TomAugspurger · 2019-10-17T13:32:25Z

Is S3_ENDPOINT a common variable in the S3 ecosystem? Should s3fs just be using it as the default value for endpoint_url if it's not otherwise specified?

Ideally, pandas has as little s3-specific code as possible. We'd like to delegate it to other libraries.

xieqihui · 2019-10-18T03:02:53Z

Hi @TomAugspurger. In this commit, if S3_ENDPOINT is not specified, s3fs will be using the default value for endpoint_url, i.e., None, as documented in Boto3 client API. In this case, s3fs will connect to AWS S3 by default. Therefore, people need minimal effort to use s3 with pandas: for those who are using AWS S3, simply use it as it was before; while for people use their own s3 servers, define S3_ENDPOINT environment variable.

S3_ENDPOINT is a common variable to be set if people use their own S3 compatible object storage servers instead of AWS S3. This scenario is becoming more and more common as solutions like MinIO, Ceph gains more popularity. One example is that tensorflow supports S3 via S3_ENDPOINT environment variable. Pandas used to support this option by specifying AWS_S3_HOST before switching from boto to s3fs in this commit as discussed in #26195.

jreback · 2019-10-18T11:30:44Z

pandas/io/s3.py

+    # the format: http(s)://{host}:{port}. If S3_ENDPOINT is undefined, it will
+    # fallback to use the default AWS S3 endpoint as determined by boto3.
+    s3_endpoint = os.environ.get("S3_ENDPOINT")
+    client_kwargs: Optional[Dict[str, str]] = {


wouldn't this be better as a feature directly to s3fs?

Well, here comes an awkward situation: From the point view of s3fs and boto3, they already support customised endpoint by accepting parameters endpoint_url via client_kwargs, so there is no point to add environment variable support for this parameter there, as discussed on this boto3 issue.

The problem with Pandas now is that it doesn't provide a way to pass this parameter over to s3fs. I think adding support using this environment variable will benefit many people using their own s3 servers.

Why wouldn't using the environment variable as the default endpoint_url go in s3fs, to benefit even more people?

TomAugspurger · 2019-10-18T11:39:07Z

Agreed with Jeff. I would prefer to see this in s3fs.

jreback · 2019-10-22T12:57:42Z

closing as out of scope for pandas, encourage to submit to s3fs.

martindurant · 2020-01-07T14:24:53Z

Would appreciate is people on this thread would comment in fsspec/s3fs#273 as to how they would expect users to codify such arguments. I suppose environment variables is certainly an option, but passing arguments would be preferable. The linked issue talks about embedding these arguments in the URL itself.

Support customised S3 servers endpoint URL

f9da50a

MarcoGorelli reviewed Oct 17, 2019

View reviewed changes

xieqihui added 2 commits October 18, 2019 10:08

add type annotation and format s3.py

febbdfe

Explain s3_endpoint

26885ed

xieqihui added 2 commits October 18, 2019 11:05

fix type annotation in s3.py

690752b

isort s3.py

1f24b98

jreback requested changes Oct 18, 2019

View reviewed changes

jreback added the IO Data IO issues that don't fit into a more specific label label Oct 18, 2019

jreback closed this Oct 22, 2019

DavidMStraub mentioned this pull request Jan 7, 2020

Implement _get_kwargs_from_urls fsspec/s3fs#273

Closed

MarcoGorelli mentioned this pull request Aug 29, 2020

AWS_S3_HOST environment variable no longer available #26195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support customised S3 servers endpoint URL #29050

Support customised S3 servers endpoint URL #29050

xieqihui commented Oct 17, 2019

MarcoGorelli Oct 17, 2019 •

edited

Loading

xieqihui Oct 18, 2019

MarcoGorelli Oct 17, 2019 •

edited

Loading

TomAugspurger commented Oct 17, 2019

xieqihui commented Oct 18, 2019

jreback Oct 18, 2019

xieqihui Oct 18, 2019

TomAugspurger Oct 18, 2019

TomAugspurger commented Oct 18, 2019

jreback commented Oct 22, 2019

martindurant commented Jan 7, 2020

Support customised S3 servers endpoint URL #29050

Support customised S3 servers endpoint URL #29050

Conversation

xieqihui commented Oct 17, 2019

MarcoGorelli Oct 17, 2019 • edited Loading

Choose a reason for hiding this comment

xieqihui Oct 18, 2019

Choose a reason for hiding this comment

MarcoGorelli Oct 17, 2019 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Oct 17, 2019

xieqihui commented Oct 18, 2019

jreback Oct 18, 2019

Choose a reason for hiding this comment

xieqihui Oct 18, 2019

Choose a reason for hiding this comment

TomAugspurger Oct 18, 2019

Choose a reason for hiding this comment

TomAugspurger commented Oct 18, 2019

jreback commented Oct 22, 2019

martindurant commented Jan 7, 2020

MarcoGorelli Oct 17, 2019 •

edited

Loading

MarcoGorelli Oct 17, 2019 •

edited

Loading