Skip to content

BUG: Fix loading files from S3 with # characters in URL (GH25945) #25992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 9, 2019

Conversation

swt2c
Copy link
Contributor

@swt2c swt2c commented Apr 4, 2019

This fixes loading files with URLs such as s3://bucket/key#1.csv. The part
from the # on was being lost because it was considered to be a URL fragment.
The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.

@gfyoung gfyoung added IO Data IO issues that don't fit into a more specific label Bug labels Apr 4, 2019
@swt2c
Copy link
Contributor Author

swt2c commented Apr 4, 2019

I believe the CI failures are due to my added test, which is causing s3fs to be imported due to importing the _strip_schema function from pandas.io.s3. This was really the only way I could see to add a test for this fix, other than try to load something from an actual S3 bucket. Is it okay to add s3fs as a test dependency, and if so, how do I do that?

@@ -356,6 +356,7 @@ I/O
- Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
- Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
- Improved :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` to read incorrectly formatted 118 format files saved by Stata (:issue:`25960`)
- Fixed bug in loading objects from S3 that contain # characters in the URL (:issue:`25945`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-back ticks on the #. you can say this is in read_* routines such as :func:read_csv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -27,3 +28,8 @@ def test_streaming_s3_objects():
for el in data:
body = StreamingBody(BytesIO(el), content_length=len(el))
read_csv(body)


def test_parse_s3_url_with_pound_sign():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a useful test, you need to mock the routines as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I have to disagree with you about this not being a useful test. It tests the buggy function, fails without the fix and passes with the fix. Can you provide any more hints about how I would mock this as you suggested? I can't see how to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look at any tests in pandas/tests/io//parser/test_network.py, you need to use the s3_resource fixture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I believe I figured it out now. :) moto seems very nice, I hadn't heard of it before.

@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #25992 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25992      +/-   ##
==========================================
- Coverage   91.82%   91.82%   -0.01%     
==========================================
  Files         175      175              
  Lines       52539    52539              
==========================================
- Hits        48246    48242       -4     
- Misses       4293     4297       +4
Flag Coverage Δ
#multiple 90.38% <100%> (ø) ⬆️
#single 40.73% <0%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/io/s3.py 89.47% <100%> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6de8133...05bf507. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 8, 2019

Codecov Report

Merging #25992 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25992      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         175      175              
  Lines       52517    52517              
==========================================
- Hits        48232    48228       -4     
- Misses       4285     4289       +4
Flag Coverage Δ
#multiple 90.39% <100%> (ø) ⬆️
#single 40.72% <0%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/io/s3.py 89.47% <100%> (ø) ⬆️
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af0ecbe...0007503. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small comments, ping on green.

@@ -198,3 +198,9 @@ def test_read_csv_chunked_download(self, s3_resource, caplog):
read_csv("s3://pandas-test/large-file.csv", nrows=5)
# log of fetch_range (start, stop)
assert ((0, 5505024) in {x.args[-2:] for x in caplog.records})

def test_read_s3_with_hash_in_key(self, tips_df):
df = read_csv('s3://pandas-test/tips#1.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call this result =


def test_read_s3_with_hash_in_key(self, tips_df):
df = read_csv('s3://pandas-test/tips#1.csv')
assert isinstance(df, DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need either of these asserts, the assert_frame_equal covers this

@@ -198,3 +198,9 @@ def test_read_csv_chunked_download(self, s3_resource, caplog):
read_csv("s3://pandas-test/large-file.csv", nrows=5)
# log of fetch_range (start, stop)
assert ((0, 5505024) in {x.args[-2:] for x in caplog.records})

def test_read_s3_with_hash_in_key(self, tips_df):
df = read_csv('s3://pandas-test/tips#1.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment here

@jreback jreback added this to the 0.25.0 milestone Apr 9, 2019
This fixes loading files with URLs such as s3://bucket/key#1.csv.  The part
from the # on was being lost because it was considered to be a URL fragment.
The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.
@swt2c
Copy link
Contributor Author

swt2c commented Apr 9, 2019

Comments addressed.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2019

great. ping on green.

@WillAyd WillAyd merged commit 2f6b90a into pandas-dev:master Apr 9, 2019
@WillAyd
Copy link
Member

WillAyd commented Apr 9, 2019

Thanks @swt2c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to open an S3 object with # in the URL
4 participants