ENH Enable streaming from S3 #11073

stephen-hoover · 2015-09-12T13:51:09Z

File reading from AWS S3: Modify the get_filepath_or_buffer function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the nrows argument) or chunked reading (e.g. through the chunksize argument) without needing to download the entire file first.

I wasn't sure what the best place was to put the OnceThroughKey. (Suggestions for better names welcome.) I don't like putting an entire class inside a function like that, but this keeps the boto dependency contained.

The readline function, and modifying next such that it returns lines, was necessary to allow the Python engine to read uncompressed CSVs.

The Python 2 standard library's gzip module needs a seek and tell function on its inputs, so I reverted to the old behavior there.

Partially addresses #11070 .

jreback · 2015-09-12T13:54:57Z

pandas/io/tests/test_parsers.py

+        import nose.tools as nt
+        df = pd.read_csv('s3://nyqpug/tips.csv', nrows=10)
+        self.assertTrue(isinstance(df, pd.DataFrame))
+        self.assertFalse(df.empty)


don't need the import of nosetools

stephen-hoover · 2015-09-12T14:22:10Z

New commit addresses your comments. I'll squash once the review is complete.

jreback · 2015-09-12T14:24:01Z

pandas/io/common.py

+        def __init__(self, *args, **kwargs):
+            encoding = kwargs.pop("encoding", None)  # Python 2 compat
+            super(OnceThroughKey, self).__init__(*args, **kwargs)
+            self.finished_read = False  # Add a flag to mark the end of the read.


wrong class name in 'init'

Oops. Fixed.

jreback · 2015-09-12T14:26:46Z

need to change class name references

add a test using chunk size

stephen-hoover · 2015-09-12T14:35:00Z

Will do. BTW -- reading with nrows or chunksize worked before, too. The difference now is that we can do it without having to download the entire file. I don't know how to write a unit test for that -- I tested it myself by looking at whether it's 100 ms or 20 s to read part of a large file. In this case, I think it's okay just to check for functionality. The way the code is written, the only way we'd lose partial reads without breaking tests is for downstream code to suck in the entire file before feeding it to the parser. (This happens right now for bz2 files without the changes in #11072 .)

jreback · 2015-09-12T14:39:57Z

in that case why don't u make an asv benchmark and read say 100 lines or something
thrn we would have a perf regression if this was changed (don't use a really long file but should take maybe 1s) to download the entire thing

stephen-hoover · 2015-09-12T14:43:58Z

I think it would need about 1 MM rows, maybe more, to be able to reliably separate performance regression from variance in network performance. I'd also need a public S3 bucket which could host such a thing, which I don't have. Do you know what the "s3://nyqpug" bucket is? For the unit tests, it would be helpful to be able to host gzip and bzip2 files there as well.

jreback · 2015-09-12T14:52:30Z

I have a pandas bucket where can put stuff
push the files up (to your branch and give me a pointer and I will put them up)

stephen-hoover · 2015-09-12T14:58:51Z

Thanks! I'll ping you when that's done.

stephen-hoover · 2015-09-12T15:15:34Z

Test using chunksize added.

stephen-hoover · 2015-09-12T16:33:38Z

@jreback , I created compressed versions of the "tips.csv" table for unit tests and large tables of random data for performance regression tests. All are in the commit at stephen-hoover@36b5d3a .

jreback · 2015-09-12T18:30:28Z

all uploaded here (its a public bucket) https://s3.amazonaws.com/pandas-test/

though let's make the big ones only in the perf tests and/or slow

stephen-hoover · 2015-09-12T20:04:56Z

Thanks! Could you put the file from "s3://nyqpug/tips.csv" in that bucket as well? The tests will be simpler if everything's in the same place.

jreback · 2015-09-12T20:19:25Z

done

stephen-hoover · 2015-09-12T20:24:35Z

Are the files set publicly readable? I can see the contents of the bucket, but when I try to read a file, I get a "Forbidden" error. That happens with the aws CLI as well as with read_csv.

jreback · 2015-09-12T20:29:54Z

should be fixed now. annoying that I had to do that for each file :< individually

stephen-hoover · 2015-09-12T20:46:34Z

Thanks. The S3 tests are more extensive now. I'll start working on performance tests, but I won't be able to have those until tomorrow afternoon.

I'll want to either rebase this PR on a merged #11072 or vice-versa. After that PR merges, the (Python 3) C parser will be able to read bz2 files from S3, and I can change the new tests to reflect that.

stephen-hoover · 2015-09-13T20:02:04Z

With #11072 merged in, I've updated the tests to reflect the fact that the Python 3 C parser can now read bz2 files from S3. I've also added a set of benchmarks for reading from S3 in the asv_bench/benchmarks/io_bench.py file. The benchmarks read 10 rows of the 100,000 row x 50 column table of random noise in the pandas-test bucket. On the current (9ef8534) master branch, benchmarks are

Python 2:
               ============= ======== ========
               --                  engine     
               ------------- -----------------
                compression   python     c    
               ============= ======== ========
                    None      27.90s   27.46s 
                    gzip      13.20s   13.13s 
                    bz2       20.43s    n/a   
               ============= ======== ========

Python 3:
               ============= ======== ========
               --                  engine     
               ------------- -----------------
                compression   python     c    
               ============= ======== ========
                    None      27.16s   27.34s 
                    gzip      14.42s   13.82s 
                    bz2       11.69s   11.77s 
               ============= ======== ========

and on this PR, the benchmarks are

Python 2:
               ============= ========== ==========
               --                    engine       
               ------------- ---------------------
                compression    python       c     
               ============= ========== ==========
                    None      208.45ms   486.83ms 
                    gzip       12.79s     13.10s  
                    bz2        20.39s      n/a    
               ============= ========== ==========

Python 3:
               ============= ========== ==========
               --                    engine       
               ------------- ---------------------
                compression    python       c     
               ============= ========== ==========
                    None      207.24ms   484.90ms 
                    gzip      249.83ms   363.71ms 
                    bz2       608.92ms   592.51ms 
               ============= ========== ==========

Note that in the > 1 s benchmarks, all of the time is taken in downloading the entire file from S3. Times will vary with network speed. It should be obvious if a future change forces read_csv to ingest the entire file from S3 again.

ASV is a really keen tool. I'm happy I found it.

jreback · 2015-09-13T20:10:06Z

ok, this seems reasonable. pls add a note in whatsnew. ping when all green.

stephen-hoover · 2015-09-14T12:47:19Z

@jreback , green!

jreback · 2015-09-14T19:50:41Z

can you rebase?

File reading from AWS S3: Modify the `get_filepath_or_buffer` function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the `nrows` argument) or chunked reading (e.g. through the `chunksize` argument) without needing to download the entire file first. Include 6 asv benchmarks for reading CSVs from S3: one for each combination of compression type and parser type.

ENH Enable streaming from S3

jreback · 2015-09-14T22:24:52Z

thank you sir!

jreback reviewed Sep 12, 2015
View reviewed changes

stephen-hoover force-pushed the stream-csv-from-s3 branch from 8393876 to 293757d Compare September 12, 2015 14:27

stephen-hoover force-pushed the stream-csv-from-s3 branch from f41af0d to 5e824f2 Compare September 12, 2015 15:15

jreback added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv labels Sep 12, 2015

stephen-hoover force-pushed the stream-csv-from-s3 branch from 5e824f2 to e61f2d2 Compare September 12, 2015 20:43

stephen-hoover mentioned this pull request Sep 12, 2015

Improvements for read_csv from AWS S3 #11070

Closed

4 tasks

stephen-hoover force-pushed the stream-csv-from-s3 branch 2 times, most recently from af4f764 to d05c118 Compare September 13, 2015 18:11

jreback added this to the 0.17.0 milestone Sep 13, 2015

stephen-hoover force-pushed the stream-csv-from-s3 branch from d05c118 to 81f9bbb Compare September 13, 2015 20:20

stephen-hoover force-pushed the stream-csv-from-s3 branch from 81f9bbb to 67879f5 Compare September 14, 2015 20:00

jreback added a commit that referenced this pull request Sep 14, 2015

Merge pull request #11073 from stephen-hoover/stream-csv-from-s3

bf0a15d

ENH Enable streaming from S3

jreback merged commit bf0a15d into pandas-dev:master Sep 14, 2015

stephen-hoover deleted the stream-csv-from-s3 branch September 14, 2015 22:32

Uh oh!

ENH Enable streaming from S3 #11073

ENH Enable streaming from S3 #11073

Uh oh!

Conversation

stephen-hoover commented Sep 12, 2015

Uh oh!

jreback Sep 12, 2015

Choose a reason for hiding this comment

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

jreback Sep 12, 2015

Choose a reason for hiding this comment

Uh oh!

stephen-hoover Sep 12, 2015

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

jreback commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

jreback commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

jreback commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

jreback commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

jreback commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 12, 2015

Uh oh!

stephen-hoover commented Sep 13, 2015

Uh oh!

jreback commented Sep 13, 2015

Uh oh!

stephen-hoover commented Sep 14, 2015

Uh oh!

jreback commented Sep 14, 2015

Uh oh!

jreback commented Sep 14, 2015

Uh oh!

Uh oh!