BUG: Fix loading files from S3 with # characters in URL (GH25945) #25992

swt2c · 2019-04-04T18:58:52Z

This fixes loading files with URLs such as s3://bucket/key#1.csv. The part
from the # on was being lost because it was considered to be a URL fragment.
The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.

closes Unable to open an S3 object with # in the URL #25945
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

swt2c · 2019-04-04T20:21:22Z

I believe the CI failures are due to my added test, which is causing s3fs to be imported due to importing the _strip_schema function from pandas.io.s3. This was really the only way I could see to add a test for this fix, other than try to load something from an actual S3 bucket. Is it okay to add s3fs as a test dependency, and if so, how do I do that?

jreback · 2019-04-05T00:12:09Z

doc/source/whatsnew/v0.25.0.rst

@@ -356,6 +356,7 @@ I/O
 - Bug in :func:`read_hdf` not properly closing store after a ``KeyError`` is raised (:issue:`25766`)
 - Bug in ``read_csv`` which would not raise ``ValueError`` if a column index in ``usecols`` was out of bounds (:issue:`25623`)
 - Improved :meth:`pandas.read_stata` and :class:`pandas.io.stata.StataReader` to read incorrectly formatted 118 format files saved by Stata (:issue:`25960`)
+- Fixed bug in loading objects from S3 that contain # characters in the URL (:issue:`25945`)


use double-back ticks on the #. you can say this is in read_* routines such as :func:read_csv

jreback · 2019-04-05T00:13:53Z

pandas/tests/io/test_s3.py

@@ -27,3 +28,8 @@ def test_streaming_s3_objects():
    for el in data:
        body = StreamingBody(BytesIO(el), content_length=len(el))
        read_csv(body)
+
+
+def test_parse_s3_url_with_pound_sign():


this is not a useful test, you need to mock the routines as above

Thanks for the review. I have to disagree with you about this not being a useful test. It tests the buggy function, fails without the fix and passes with the fix. Can you provide any more hints about how I would mock this as you suggested? I can't see how to do it.

look at any tests in pandas/tests/io//parser/test_network.py, you need to use the s3_resource fixture.

Thanks, I believe I figured it out now. :) moto seems very nice, I hadn't heard of it before.

codecov · 2019-04-08T04:01:34Z

Codecov Report

Merging #25992 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25992      +/-   ##
==========================================
- Coverage   91.82%   91.82%   -0.01%     
==========================================
  Files         175      175              
  Lines       52539    52539              
==========================================
- Hits        48246    48242       -4     
- Misses       4293     4297       +4

Flag	Coverage Δ
#multiple	`90.38% <100%> (ø)`	⬆️
#single	`40.73% <0%> (-0.14%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/s3.py	`89.47% <100%> (ø)`	⬆️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6de8133...05bf507. Read the comment docs.

codecov · 2019-04-08T04:01:34Z

Codecov Report

Merging #25992 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25992      +/-   ##
==========================================
- Coverage   91.84%   91.83%   -0.01%     
==========================================
  Files         175      175              
  Lines       52517    52517              
==========================================
- Hits        48232    48228       -4     
- Misses       4285     4289       +4

Flag	Coverage Δ
#multiple	`90.39% <100%> (ø)`	⬆️
#single	`40.72% <0%> (-0.14%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/s3.py	`89.47% <100%> (ø)`	⬆️
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af0ecbe...0007503. Read the comment docs.

jreback

lgtm. small comments, ping on green.

jreback · 2019-04-09T14:41:50Z

pandas/tests/io/parser/test_network.py

@@ -198,3 +198,9 @@ def test_read_csv_chunked_download(self, s3_resource, caplog):
            read_csv("s3://pandas-test/large-file.csv", nrows=5)
            # log of fetch_range (start, stop)
            assert ((0, 5505024) in {x.args[-2:] for x in caplog.records})
+
+    def test_read_s3_with_hash_in_key(self, tips_df):
+        df = read_csv('s3://pandas-test/tips#1.csv')


can you call this result =

jreback · 2019-04-09T14:42:07Z

pandas/tests/io/parser/test_network.py

+
+    def test_read_s3_with_hash_in_key(self, tips_df):
+        df = read_csv('s3://pandas-test/tips#1.csv')
+        assert isinstance(df, DataFrame)


you don't need either of these asserts, the assert_frame_equal covers this

jreback · 2019-04-09T14:42:17Z

pandas/tests/io/parser/test_network.py

@@ -198,3 +198,9 @@ def test_read_csv_chunked_download(self, s3_resource, caplog):
            read_csv("s3://pandas-test/large-file.csv", nrows=5)
            # log of fetch_range (start, stop)
            assert ((0, 5505024) in {x.args[-2:] for x in caplog.records})
+
+    def test_read_s3_with_hash_in_key(self, tips_df):
+        df = read_csv('s3://pandas-test/tips#1.csv')


can you add the issue number as a comment here

This fixes loading files with URLs such as s3://bucket/key#1.csv. The part from the # on was being lost because it was considered to be a URL fragment. The fix disables URL fragment parsing as it doesn't make sense for S3 URLs.

swt2c · 2019-04-09T15:02:31Z

Comments addressed.

jreback · 2019-04-09T15:06:08Z

great. ping on green.

WillAyd · 2019-04-09T17:28:30Z

Thanks @swt2c

gfyoung added IO Data IO issues that don't fit into a more specific label Bug labels Apr 4, 2019

jreback requested changes Apr 5, 2019

View reviewed changes

swt2c force-pushed the fix_s3_#_url branch from f2b5529 to 05bf507 Compare April 8, 2019 03:24

jreback requested changes Apr 9, 2019

View reviewed changes

jreback added this to the 0.25.0 milestone Apr 9, 2019

swt2c force-pushed the fix_s3_#_url branch from 05bf507 to 0007503 Compare April 9, 2019 15:01

jreback approved these changes Apr 9, 2019

View reviewed changes

WillAyd merged commit 2f6b90a into pandas-dev:master Apr 9, 2019

Uh oh!

BUG: Fix loading files from S3 with # characters in URL (GH25945) #25992

BUG: Fix loading files from S3 with # characters in URL (GH25945) #25992

Uh oh!

Conversation

swt2c commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

swt2c commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 8, 2019

Codecov Report

Uh oh!

codecov bot commented Apr 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swt2c commented Apr 9, 2019

Uh oh!

jreback commented Apr 9, 2019

Uh oh!

WillAyd commented Apr 9, 2019

Uh oh!

Uh oh!

swt2c commented Apr 4, 2019 •

edited

Loading

codecov bot commented Apr 8, 2019 •

edited

Loading