Skip to content

Unable to open an S3 object with # in the URL #25945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
swt2c opened this issue Apr 1, 2019 · 2 comments · Fixed by #25992
Closed

Unable to open an S3 object with # in the URL #25945

swt2c opened this issue Apr 1, 2019 · 2 comments · Fixed by #25992
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@swt2c
Copy link
Contributor

swt2c commented Apr 1, 2019

import pandas as pd
df = pd.read_csv('s3://bucket/key#1.csv')
df = pd.read_csv('s3://bucket/key%231.csv')

Problem description

Pandas can't open an object from S3 if it has a # sign in the URL, both in the case where the URL path is percent encoded and not. The reason is that urllib.parse.urlparse(), which is used in io/s3.py to parse the URL, treats the # sign as the beginning of the URL fragment, and thus it is lost (in the case of not percent encoded).

I see two possible solutions to the problem, but I'm not sure which one is best, since there does not seem to be a 'specification' for the S3 URL scheme (at least that I can find):

  1. Use allow_fragments=False when calling urllib.parse.urlparse(). This would allow the non-percent-encoded case to work, but seems slightly wrong.
  2. Call urllib.parse.unquote() on S3 paths before passing to s3fs. s3fs seems to want just a bucket/key as input, so pandas would have to remove the percent encoding. This would allow the percent-encoded case to work. It seems a bit more correct, but it might change some existing behavior where users could be loading URLs with % characters in them.
@WillAyd WillAyd added the IO Data IO issues that don't fit into a more specific label label Apr 1, 2019
@WillAyd
Copy link
Member

WillAyd commented Apr 1, 2019

Not certain myself - @TomAugspurger may have thoughts but otherwise if you have a proposal in mind feel free to submit a PR

@yackoa
Copy link

yackoa commented Apr 3, 2019

Based on the analysis by @swt2c , I think the option#1 is the way to go (feel free to correct me if I am wrong).
I can start working on this issue & submit a PR with the changed code soon if that's ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants