Skip to content

Support customised S3 servers endpoint URL #29050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

xieqihui
Copy link

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Add support for customised S3 servers via checking the environment variable S3_ENDPOINT.

If set, the value of S3_ENDPOINT will be passed to the endpoint_url parameter for boto3.Session. It should be the complete URL to S3 service following the format: http(s)://{host}:{port}.

This feature is useful for companies who use their own S3 servers like MinIO, Ceph etc, as mentioned in issue #26195

pandas/io/s3.py Outdated
# the format: http(s)://{host}:{port}
s3_endpoint = os.environ.get('S3_ENDPOINT')
if s3_endpoint:
client_kwargs = {'endpoint_url': s3_endpoint}
Copy link
Member

@MarcoGorelli MarcoGorelli Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi xieqihui :) I believe there needs to be a type annotation, such as

        client_kwargs: Optional[Dict[str, str]] = {"endpoint_url": s3_endpoint}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. I have added the type annotation and format the code.

pandas/io/s3.py Outdated
# Support customised S3 servers endpoint URL via environment variable
# The S3_ENDPOINT should be the complete URL to S3 service following
# the format: http(s)://{host}:{port}
s3_endpoint = os.environ.get('S3_ENDPOINT')
Copy link
Member

@MarcoGorelli MarcoGorelli Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the CI process checks that the code would be left unchanged by the black autoformatter.
Try running

black pandas/io/s3.py --diff

to see the changes (and leave out the --diff flag to apply them)

@TomAugspurger
Copy link
Contributor

Is S3_ENDPOINT a common variable in the S3 ecosystem? Should s3fs just be using it as the default value for endpoint_url if it's not otherwise specified?

Ideally, pandas has as little s3-specific code as possible. We'd like to delegate it to other libraries.

@xieqihui
Copy link
Author

Hi @TomAugspurger. In this commit, if S3_ENDPOINT is not specified, s3fs will be using the default value for endpoint_url, i.e., None, as documented in Boto3 client API. In this case, s3fs will connect to AWS S3 by default. Therefore, people need minimal effort to use s3 with pandas: for those who are using AWS S3, simply use it as it was before; while for people use their own s3 servers, define S3_ENDPOINT environment variable.

S3_ENDPOINT is a common variable to be set if people use their own S3 compatible object storage servers instead of AWS S3. This scenario is becoming more and more common as solutions like MinIO, Ceph gains more popularity. One example is that tensorflow supports S3 via S3_ENDPOINT environment variable. Pandas used to support this option by specifying AWS_S3_HOST before switching from boto to s3fs in this commit as discussed in #26195.

# the format: http(s)://{host}:{port}. If S3_ENDPOINT is undefined, it will
# fallback to use the default AWS S3 endpoint as determined by boto3.
s3_endpoint = os.environ.get("S3_ENDPOINT")
client_kwargs: Optional[Dict[str, str]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this be better as a feature directly to s3fs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, here comes an awkward situation: From the point view of s3fs and boto3, they already support customised endpoint by accepting parameters endpoint_url via client_kwargs, so there is no point to add environment variable support for this parameter there, as discussed on this boto3 issue.

The problem with Pandas now is that it doesn't provide a way to pass this parameter over to s3fs. I think adding support using this environment variable will benefit many people using their own s3 servers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't using the environment variable as the default endpoint_url go in s3fs, to benefit even more people?

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Oct 18, 2019
@TomAugspurger
Copy link
Contributor

Agreed with Jeff. I would prefer to see this in s3fs.

@jreback
Copy link
Contributor

jreback commented Oct 22, 2019

closing as out of scope for pandas, encourage to submit to s3fs.

@martindurant
Copy link
Contributor

Would appreciate is people on this thread would comment in fsspec/s3fs#273 as to how they would expect users to codify such arguments. I suppose environment variables is certainly an option, but passing arguments would be preferable. The linked issue talks about embedding these arguments in the URL itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants