-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
read_parquet S3 dir support #26388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pls file your issue on the arrow tracker these are just passed thru |
Both pyarrow and fastparquet support reading from a directory of files. So it can also be an issue in how we call those libraries. @Seanspt what is the error that you get? |
Pandas works fine if I download the spark-saved dir and read it by passing a local dir path. >>> pd.read_parquet("s3://xxx/xxx/df")
Traceback (most recent call last):
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 524, in info
Key=key, **self.req_kw)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 195, in _call_s3
return method(**additional_kwargs)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 320, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 624, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/s3.py", line 29, in get_filepath_or_buffer
filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 350, in open
s3_additional_kwargs=kw)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1206, in __init__
info = self.info()
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1224, in info
refresh=refresh, **kwargs)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 535, in info
raise_from(FileNotFoundError(path), e)
File "<string>", line 3, in raise_from
FileNotFoundError: xxx/xxx/df
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 524, in info
Key=key, **self.req_kw)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 195, in _call_s3
return method(**additional_kwargs)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 320, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 624, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/parquet.py", line 288, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/parquet.py", line 124, in read
path, _, _, should_close = get_filepath_or_buffer(path)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/common.py", line 209, in get_filepath_or_buffer
mode=mode)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/s3.py", line 38, in get_filepath_or_buffer
filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 350, in open
s3_additional_kwargs=kw)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1206, in __init__
info = self.info()
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1224, in info
refresh=refresh, **kwargs)
File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 535, in info
raise_from(FileNotFoundError(path), e)
File "<string>", line 3, in raise_from
FileNotFoundError: xxx/xxx/df |
As a reference here. Pyarrow and fast parquet can do this in several lines of code. |
Thanks for the follow-up! If you read the directory on s3 directly with pyarrow as done in the answer of the StackOverflow question, then it also works for you? Can you show that? |
Yes, the following code works for me. import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem(region_name='cn-north-1')
bucket_uri = f's3://xxx/xxx/df'
dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas() |
Thanks! @TomAugspurger I suppose the problem is that our s3 compat code assumes it is a single file and tries to open it? ( I suppose one option might be to, in case of an s3 path, not to use our For reference, the PR where this was added: #19135 |
Letting the engine handle the user-provided path seems reasonable.
…On Fri, May 17, 2019 at 1:56 AM Joris Van den Bossche < ***@***.***> wrote:
Thanks!
@TomAugspurger <https://github.com/TomAugspurger> I suppose the problem
is that our s3 compat code assumes it is a single file and tries to open
it? (read_parquet also accepts a directory name to read)
We use S3FileSystem.open(..), which is for single files.
I suppose one option might be to, in case of an s3 path, not to use our
get_filepath_or_buffer but let pyarrow handle the s3?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26388?email_source=notifications&email_token=AAKAOIX5Y4M6WD7VFL6TR43PVZJIXA5CNFSM4HMYSOK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVT5GFY#issuecomment-493343511>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIV6QNEL4GDVMVQVQP3PVZJIXANCNFSM4HMYSOKQ>
.
|
@jorisvandenbossche i'm happy to take a look at this. You ideas make sense to me. Think we can get rid of our |
closed by #33632 |
Can we reopen this issue? As mentioned in the 1.0.5 release notes (#34785) this is not yet fixed, and took me a while to find out why I was still getting it in 1.0.5 despite the fact it was fixed in 1.0.4 |
It’s being released in pandas 1.1 -> happy to reopen if you have issues. Perhaps try build master and check it works as expected? |
But what if the parquet file size is huge? In such cases, can I make python generators over the parquet file? If yes, how? |
I'm getting the same exception here
With the source directory it doesn't:
In both cases:
|
Hi I'm seeing the same issue, just wondering if you have a way to resolve it? |
When using spark to process data and save to s3, the files are like
It would be nice to load these files in one line.
The text was updated successfully, but these errors were encountered: