Skip to content

ENH Enable streaming from S3 #11073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 14, 2015

Conversation

stephen-hoover
Copy link
Contributor

File reading from AWS S3: Modify the get_filepath_or_buffer function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the nrows argument) or chunked reading (e.g. through the chunksize argument) without needing to download the entire file first.

I wasn't sure what the best place was to put the OnceThroughKey. (Suggestions for better names welcome.) I don't like putting an entire class inside a function like that, but this keeps the boto dependency contained.

The readline function, and modifying next such that it returns lines, was necessary to allow the Python engine to read uncompressed CSVs.

The Python 2 standard library's gzip module needs a seek and tell function on its inputs, so I reverted to the old behavior there.

Partially addresses #11070 .

import nose.tools as nt
df = pd.read_csv('s3://nyqpug/tips.csv', nrows=10)
self.assertTrue(isinstance(df, pd.DataFrame))
self.assertFalse(df.empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the import of nosetools

@stephen-hoover
Copy link
Contributor Author

New commit addresses your comments. I'll squash once the review is complete.

def __init__(self, *args, **kwargs):
encoding = kwargs.pop("encoding", None) # Python 2 compat
super(OnceThroughKey, self).__init__(*args, **kwargs)
self.finished_read = False # Add a flag to mark the end of the read.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong class name in 'init'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

need to change class name references

add a test using chunk size

@stephen-hoover
Copy link
Contributor Author

Will do. BTW -- reading with nrows or chunksize worked before, too. The difference now is that we can do it without having to download the entire file. I don't know how to write a unit test for that -- I tested it myself by looking at whether it's 100 ms or 20 s to read part of a large file. In this case, I think it's okay just to check for functionality. The way the code is written, the only way we'd lose partial reads without breaking tests is for downstream code to suck in the entire file before feeding it to the parser. (This happens right now for bz2 files without the changes in #11072 .)

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

in that case why don't u make an asv benchmark and read say 100 lines or something
thrn we would have a perf regression if this was changed (don't use a really long file but should take maybe 1s) to download the entire thing

@stephen-hoover
Copy link
Contributor Author

I think it would need about 1 MM rows, maybe more, to be able to reliably separate performance regression from variance in network performance. I'd also need a public S3 bucket which could host such a thing, which I don't have. Do you know what the "s3://nyqpug" bucket is? For the unit tests, it would be helpful to be able to host gzip and bzip2 files there as well.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

I have a pandas bucket where can put stuff
push the files up (to your branch and give me a pointer and I will put them up)

@stephen-hoover
Copy link
Contributor Author

Thanks! I'll ping you when that's done.

@stephen-hoover
Copy link
Contributor Author

Test using chunksize added.

@stephen-hoover
Copy link
Contributor Author

@jreback , I created compressed versions of the "tips.csv" table for unit tests and large tables of random data for performance regression tests. All are in the commit at stephen-hoover@36b5d3a .

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

all uploaded here (its a public bucket) https://s3.amazonaws.com/pandas-test/

though let's make the big ones only in the perf tests and/or slow

@stephen-hoover
Copy link
Contributor Author

Thanks! Could you put the file from "s3://nyqpug/tips.csv" in that bucket as well? The tests will be simpler if everything's in the same place.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

done

@jreback jreback added IO Data IO issues that don't fit into a more specific label IO CSV read_csv, to_csv labels Sep 12, 2015
@stephen-hoover
Copy link
Contributor Author

Are the files set publicly readable? I can see the contents of the bucket, but when I try to read a file, I get a "Forbidden" error. That happens with the aws CLI as well as with read_csv.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

should be fixed now. annoying that I had to do that for each file :< individually

@stephen-hoover
Copy link
Contributor Author

Thanks. The S3 tests are more extensive now. I'll start working on performance tests, but I won't be able to have those until tomorrow afternoon.

I'll want to either rebase this PR on a merged #11072 or vice-versa. After that PR merges, the (Python 3) C parser will be able to read bz2 files from S3, and I can change the new tests to reflect that.

@stephen-hoover stephen-hoover force-pushed the stream-csv-from-s3 branch 2 times, most recently from af4f764 to d05c118 Compare September 13, 2015 18:11
@stephen-hoover
Copy link
Contributor Author

With #11072 merged in, I've updated the tests to reflect the fact that the Python 3 C parser can now read bz2 files from S3. I've also added a set of benchmarks for reading from S3 in the asv_bench/benchmarks/io_bench.py file. The benchmarks read 10 rows of the 100,000 row x 50 column table of random noise in the pandas-test bucket. On the current (9ef8534) master branch, benchmarks are

Python 2:
               ============= ======== ========
               --                  engine     
               ------------- -----------------
                compression   python     c    
               ============= ======== ========
                    None      27.90s   27.46s 
                    gzip      13.20s   13.13s 
                    bz2       20.43s    n/a   
               ============= ======== ========

Python 3:
               ============= ======== ========
               --                  engine     
               ------------- -----------------
                compression   python     c    
               ============= ======== ========
                    None      27.16s   27.34s 
                    gzip      14.42s   13.82s 
                    bz2       11.69s   11.77s 
               ============= ======== ========

and on this PR, the benchmarks are

Python 2:
               ============= ========== ==========
               --                    engine       
               ------------- ---------------------
                compression    python       c     
               ============= ========== ==========
                    None      208.45ms   486.83ms 
                    gzip       12.79s     13.10s  
                    bz2        20.39s      n/a    
               ============= ========== ==========

Python 3:
               ============= ========== ==========
               --                    engine       
               ------------- ---------------------
                compression    python       c     
               ============= ========== ==========
                    None      207.24ms   484.90ms 
                    gzip      249.83ms   363.71ms 
                    bz2       608.92ms   592.51ms 
               ============= ========== ==========

Note that in the > 1 s benchmarks, all of the time is taken in downloading the entire file from S3. Times will vary with network speed. It should be obvious if a future change forces read_csv to ingest the entire file from S3 again.

ASV is a really keen tool. I'm happy I found it.

@jreback
Copy link
Contributor

jreback commented Sep 13, 2015

ok, this seems reasonable. pls add a note in whatsnew. ping when all green.

@jreback jreback added this to the 0.17.0 milestone Sep 13, 2015
@stephen-hoover
Copy link
Contributor Author

@jreback , green!

@jreback
Copy link
Contributor

jreback commented Sep 14, 2015

can you rebase?

File reading from AWS S3: Modify the `get_filepath_or_buffer` function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the `nrows` argument) or chunked reading (e.g. through the `chunksize` argument) without needing to download the entire file first.

Include 6 asv benchmarks for reading CSVs from S3: one for each combination of compression type and parser type.
jreback added a commit that referenced this pull request Sep 14, 2015
@jreback jreback merged commit bf0a15d into pandas-dev:master Sep 14, 2015
@jreback
Copy link
Contributor

jreback commented Sep 14, 2015

thank you sir!

@stephen-hoover stephen-hoover deleted the stream-csv-from-s3 branch September 14, 2015 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants