-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pd.read_parquet can't handle an S3 directory #28490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you look into our IO code and write a proposal for how we could support this? Generally, we just deal with a single file / buffer. It's possible that this is doable as long as we hand things off to the parquet engine relatively quickly, but I'm not sure. |
Can is this specific to S3? Or do we see the same behavior for directories on the local file system? |
@TomAugspurger this problem is specific to S3, it works fine locally. I’ll look into it, thanks. |
I seem to recall from an earlier similar discussion the idea to just pass through to pyarrow in case of s3, and let them handle that (since they support it). Currently we call Lines 233 to 238 in 2b28454
which then in our s3 codes tries to open the file: Lines 20 to 50 in 2b28454
But since you can pass a directory or wildcard, it is not always a valid file. So one solution would be for this case, just pass the s3 string path directly to pyarrow (and also fastparquet I suppose)? |
@jorisvandenbossche @TomAugspurger This is exactly what i was thinking. I'll be happy to make the changes. Btw, See: #26551 See also apache/arrow@d235f69 which went out in pyarrow release which was released in July. The ticket says pandas would add this when pyarrow shipped, and it has shipped :) I would be happy to add this as well. Do you have any thoughts? The way this works in pyarrow is: s3_path = f's3://{BUCKET}/{PATH}/Questions.Stratified.Final.2000.parquet'
s3 = s3fs.S3FileSystem()
columns = ['field_1', 'field_2']
filters = ['field_1', '=', 0]
# Have to use pyarrow to load a parquet directory into pandas
posts_df = ParquetDataset(
s3_path,
filesystem=s3,
columns=columns,
filters=filters,
).read().to_pandas() |
It is "added" automatically, because additional arguments are passed through to pyarrow (so it should work already). But, a PR to add this to the documentation so it gets more visibility is certainly welcome! |
Fastparquet should handle the string just fine as well.
…On Thu, Sep 19, 2019 at 1:35 AM Joris Van den Bossche < ***@***.***> wrote:
The ticket says pandas would add this when pyarrow shipped, and it has
shipped :) I would be happy to add this as well.
It is "added" automatically, because additional arguments are passed
through to pyarrow. But, a PR to add this to the documentation so it gets
more visibility is certainly welcome!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#28490?email_source=notifications&email_token=AAKAOIXJBW64G3LMN4VGKZLQKMMTZA5CNFSM4IXYDMK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7CMHHI#issuecomment-532988829>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIQNZG4Q65MTVLGMY33QKMMTZANCNFSM4IXYDMKQ>
.
|
Ok, I'll do it this weekend! |
I've also experienced many issues with pandas reading S3-based parquet files ever since Many of the most recent errors appear to be resolved by forcing |
@jorisvandenbossche @TomAugspurger Hmmm... could this be a solution? |
@rjurney that's a different issue. There have been some recent failures with s3fs, but even if that is fully working, you still have the problem that we should not try to open a file if the argument is a directory. |
On our team we would love to see this addressed; are there any plans to do so? Is this something for which you would consider accepting a pull request? |
@dlindelof agree - lets continue the discussion over in #26388 since that's addressing the same problem |
Is this still an issue for anyone else? :( |
Summary
pyarrow can load parquet files directly from S3. So can Dask. pandas seems to not be able to. This would be really cool and since you use pyarrow underneath it should be easy.
For example in pyarrow, even with push-down filters:
Code
This code loads field names from an S3 file then loads an S3 directory containing lots of parquet files. Note that I have repeatedly verified that files are there and are publicly accessible. boto3 can directly access them individually.
It throws this exception if I use
s3://bucket/path/file_dir
:It throws this exception if I use
s3://bucket/path/file_dir/* or *.parquet
:Problem description
It seems that
pd.read_parquet
can't read a directory structured Parquet file from Amazon S3. I've tried a wildcard and it also throws an error. This is a bummer :(Expected Output
Load the data, return a DataFrame.
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: