You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pandas can't open an object from S3 if it has a # sign in the URL, both in the case where the URL path is percent encoded and not. The reason is that urllib.parse.urlparse(), which is used in io/s3.py to parse the URL, treats the # sign as the beginning of the URL fragment, and thus it is lost (in the case of not percent encoded).
I see two possible solutions to the problem, but I'm not sure which one is best, since there does not seem to be a 'specification' for the S3 URL scheme (at least that I can find):
Use allow_fragments=False when calling urllib.parse.urlparse(). This would allow the non-percent-encoded case to work, but seems slightly wrong.
Call urllib.parse.unquote() on S3 paths before passing to s3fs. s3fs seems to want just a bucket/key as input, so pandas would have to remove the percent encoding. This would allow the percent-encoded case to work. It seems a bit more correct, but it might change some existing behavior where users could be loading URLs with % characters in them.
The text was updated successfully, but these errors were encountered:
Based on the analysis by @swt2c , I think the option#1 is the way to go (feel free to correct me if I am wrong).
I can start working on this issue & submit a PR with the changed code soon if that's ok.
Problem description
Pandas can't open an object from S3 if it has a # sign in the URL, both in the case where the URL path is percent encoded and not. The reason is that urllib.parse.urlparse(), which is used in io/s3.py to parse the URL, treats the # sign as the beginning of the URL fragment, and thus it is lost (in the case of not percent encoded).
I see two possible solutions to the problem, but I'm not sure which one is best, since there does not seem to be a 'specification' for the S3 URL scheme (at least that I can find):
allow_fragments=False
when callingurllib.parse.urlparse()
. This would allow the non-percent-encoded case to work, but seems slightly wrong.urllib.parse.unquote()
on S3 paths before passing to s3fs. s3fs seems to want just a bucket/key as input, so pandas would have to remove the percent encoding. This would allow the percent-encoded case to work. It seems a bit more correct, but it might change some existing behavior where users could be loading URLs with % characters in them.The text was updated successfully, but these errors were encountered: