You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I frequently find myself interacting with CSV files stored in Amazon's S3 service, and have run into a few areas where I think small improvements in read_csv could be a big help.
This is the most important improvement for me. The current pandas code downloads the entire file from S3 before passing it into the parser. If I have a 6 GB file in S3, it's much better to not need to download the entire thing just to check the first few rows with the "nrows" keyword to read_csv. Or perhaps I want to process the file one chunk at a time using "chunksize". We can iterate through a file on disk in these ways, but not currently with a file in S3.
Currently, the C parser refuses open bz2-compressed file objects entirely, and the Python parser decompresses the entire file before continuing, which runs into the same problem with needing to read in a potentially large file before doing any work.
I frequently find myself interacting with CSV files stored in Amazon's S3 service, and have run into a few areas where I think small improvements in
read_csv
could be a big help.This is the most important improvement for me. The current pandas code downloads the entire file from S3 before passing it into the parser. If I have a 6 GB file in S3, it's much better to not need to download the entire thing just to check the first few rows with the "nrows" keyword to
read_csv
. Or perhaps I want to process the file one chunk at a time using "chunksize". We can iterate through a file on disk in these ways, but not currently with a file in S3.get_filepath_or_buffer
#11074 Infer compression type from S3 filenamesIf an S3 filename ends with ".gz" or ".bz2", the parser should be able to infer the compression type, just as with a file on disk.
Currently, the C parser refuses open bz2-compressed file objects entirely, and the Python parser decompresses the entire file before continuing, which runs into the same problem with needing to read in a potentially large file before doing any work.
I've only run into this when using Spark, and I admit I don't fully understand the difference. It seems that S3 files can be accessed via "s3://" or "s3n://" ("S3 native"). It would be useful for pandas to recognize both. Some notes I found: https://wiki.apache.org/hadoop/AmazonS3 http://notes.mindprince.in/2014/08/01/difference-between-s3-block-and-s3-native-filesystem-on-hadoop.html
I will open PRs to address each of these.
The text was updated successfully, but these errors were encountered: