-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH Enable streaming from S3 #11073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Enable streaming from S3 #11073
Conversation
import nose.tools as nt | ||
df = pd.read_csv('s3://nyqpug/tips.csv', nrows=10) | ||
self.assertTrue(isinstance(df, pd.DataFrame)) | ||
self.assertFalse(df.empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the import of nosetools
New commit addresses your comments. I'll squash once the review is complete. |
def __init__(self, *args, **kwargs): | ||
encoding = kwargs.pop("encoding", None) # Python 2 compat | ||
super(OnceThroughKey, self).__init__(*args, **kwargs) | ||
self.finished_read = False # Add a flag to mark the end of the read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong class name in 'init'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Fixed.
need to change class name references add a test using chunk size |
8393876
to
293757d
Compare
Will do. BTW -- reading with |
in that case why don't u make an asv benchmark and read say 100 lines or something |
I think it would need about 1 MM rows, maybe more, to be able to reliably separate performance regression from variance in network performance. I'd also need a public S3 bucket which could host such a thing, which I don't have. Do you know what the "s3://nyqpug" bucket is? For the unit tests, it would be helpful to be able to host gzip and bzip2 files there as well. |
I have a pandas bucket where can put stuff |
Thanks! I'll ping you when that's done. |
f41af0d
to
5e824f2
Compare
Test using chunksize added. |
@jreback , I created compressed versions of the "tips.csv" table for unit tests and large tables of random data for performance regression tests. All are in the commit at stephen-hoover@36b5d3a . |
all uploaded here (its a public bucket) though let's make the big ones only in the perf tests and/or slow |
Thanks! Could you put the file from "s3://nyqpug/tips.csv" in that bucket as well? The tests will be simpler if everything's in the same place. |
done |
Are the files set publicly readable? I can see the contents of the bucket, but when I try to read a file, I get a "Forbidden" error. That happens with the aws CLI as well as with |
should be fixed now. annoying that I had to do that for each file :< individually |
5e824f2
to
e61f2d2
Compare
Thanks. The S3 tests are more extensive now. I'll start working on performance tests, but I won't be able to have those until tomorrow afternoon. I'll want to either rebase this PR on a merged #11072 or vice-versa. After that PR merges, the (Python 3) C parser will be able to read bz2 files from S3, and I can change the new tests to reflect that. |
af4f764
to
d05c118
Compare
With #11072 merged in, I've updated the tests to reflect the fact that the Python 3 C parser can now read bz2 files from S3. I've also added a set of benchmarks for reading from S3 in the
and on this PR, the benchmarks are
Note that in the > 1 s benchmarks, all of the time is taken in downloading the entire file from S3. Times will vary with network speed. It should be obvious if a future change forces ASV is a really keen tool. I'm happy I found it. |
ok, this seems reasonable. pls add a note in whatsnew. ping when all green. |
d05c118
to
81f9bbb
Compare
@jreback , green! |
can you rebase? |
File reading from AWS S3: Modify the `get_filepath_or_buffer` function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through the `nrows` argument) or chunked reading (e.g. through the `chunksize` argument) without needing to download the entire file first. Include 6 asv benchmarks for reading CSVs from S3: one for each combination of compression type and parser type.
81f9bbb
to
67879f5
Compare
thank you sir! |
File reading from AWS S3: Modify the
get_filepath_or_buffer
function such that it only opens the connection to S3, rather than reading the entire file at once. This allows partial reads (e.g. through thenrows
argument) or chunked reading (e.g. through thechunksize
argument) without needing to download the entire file first.I wasn't sure what the best place was to put the
OnceThroughKey
. (Suggestions for better names welcome.) I don't like putting an entire class inside a function like that, but this keeps theboto
dependency contained.The
readline
function, and modifyingnext
such that it returns lines, was necessary to allow the Python engine to read uncompressed CSVs.The Python 2 standard library's
gzip
module needs aseek
andtell
function on its inputs, so I reverted to the old behavior there.Partially addresses #11070 .