-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix loading files from S3 with # characters in URL (GH25945) #25992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I believe the CI failures are due to my added test, which is causing s3fs to be imported due to importing the |
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -356,6 +356,7 @@ I/O | |||
- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`) | |||
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`) | |||
- Improved :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` to read incorrectly formatted 118 format files saved by Stata (:issue:`25960`) | |||
- Fixed bug in loading objects from S3 that contain # characters in the URL (:issue:`25945`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double-back ticks on the #
. you can say this is in read_*
routines such as :func:read_csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pandas/tests/io/test_s3.py
Outdated
@@ -27,3 +28,8 @@ def test_streaming_s3_objects(): | |||
for el in data: | |||
body = StreamingBody(BytesIO(el), content_length=len(el)) | |||
read_csv(body) | |||
|
|||
|
|||
def test_parse_s3_url_with_pound_sign(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a useful test, you need to mock the routines as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I have to disagree with you about this not being a useful test. It tests the buggy function, fails without the fix and passes with the fix. Can you provide any more hints about how I would mock this as you suggested? I can't see how to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look at any tests in pandas/tests/io//parser/test_network.py, you need to use the s3_resource fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I believe I figured it out now. :) moto
seems very nice, I hadn't heard of it before.
Codecov Report
@@ Coverage Diff @@
## master #25992 +/- ##
==========================================
- Coverage 91.82% 91.82% -0.01%
==========================================
Files 175 175
Lines 52539 52539
==========================================
- Hits 48246 48242 -4
- Misses 4293 4297 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25992 +/- ##
==========================================
- Coverage 91.84% 91.83% -0.01%
==========================================
Files 175 175
Lines 52517 52517
==========================================
- Hits 48232 48228 -4
- Misses 4285 4289 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small comments, ping on green.
@@ -198,3 +198,9 @@ def test_read_csv_chunked_download(self, s3_resource, caplog): | |||
read_csv("s3://pandas-test/large-file.csv", nrows=5) | |||
# log of fetch_range (start, stop) | |||
assert ((0, 5505024) in {x.args[-2:] for x in caplog.records}) | |||
|
|||
def test_read_s3_with_hash_in_key(self, tips_df): | |||
df = read_csv('s3://pandas-test/tips#1.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call this result =
|
||
def test_read_s3_with_hash_in_key(self, tips_df): | ||
df = read_csv('s3://pandas-test/tips#1.csv') | ||
assert isinstance(df, DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need either of these asserts, the assert_frame_equal covers this
@@ -198,3 +198,9 @@ def test_read_csv_chunked_download(self, s3_resource, caplog): | |||
read_csv("s3://pandas-test/large-file.csv", nrows=5) | |||
# log of fetch_range (start, stop) | |||
assert ((0, 5505024) in {x.args[-2:] for x in caplog.records}) | |||
|
|||
def test_read_s3_with_hash_in_key(self, tips_df): | |||
df = read_csv('s3://pandas-test/tips#1.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment here
This fixes loading files with URLs such as s3://bucket/key#1.csv. The part from the # on was being lost because it was considered to be a URL fragment. The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.
Comments addressed. |
great. ping on green. |
Thanks @swt2c |
This fixes loading files with URLs such as s3://bucket/key#1.csv. The part
from the # on was being lost because it was considered to be a URL fragment.
The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.
git diff upstream/master -u -- "*.py" | flake8 --diff