Improvements for read_csv from AWS S3

I frequently find myself interacting with CSV files stored in Amazon's S3 service, and have run into a few areas where I think small improvements in `read_csv` could be a big help. 
- [x] #11073 Allow streaming reads<br>
  This is the most important improvement for me. The current pandas code downloads the entire file from S3 before passing it into the parser. If I have a 6 GB file in S3, it's much better to not need to download the entire thing just to check the first few rows with the "nrows" keyword to `read_csv`. Or perhaps I want to process the file one chunk at a time using "chunksize". We can iterate through a file on disk in these ways, but not currently with a file in S3.
- [x] #11074  Infer compression type from S3 filenames<br>
  If an S3 filename ends with ".gz" or ".bz2", the parser should be able to infer the compression type, just as with a file on disk.
- [x] #11072  Streaming bz2 reads, C parser bz2 reads<br>
  Currently, the C parser refuses open bz2-compressed file objects entirely, and the Python parser decompresses the entire file before continuing, which runs into the same problem with needing to read in a potentially large file before doing any work.
- [x] #11071  Recognize "s3n" EDIT: and "s3a"<br>
  I've only run into this when using Spark, and I admit I don't fully understand the difference. It seems that S3 files can be accessed via "s3://" or "s3n://" ("S3 native"). It would be useful for pandas to recognize both. Some notes I found: https://wiki.apache.org/hadoop/AmazonS3 http://notes.mindprince.in/2014/08/01/difference-between-s3-block-and-s3-native-filesystem-on-hadoop.html

I will open PRs to address each of these.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improvements for read_csv from AWS S3 #11070

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improvements for read_csv from AWS S3 #11070

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions