Skip to content

read_parquet S3 dir support #26388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Seanspt opened this issue May 14, 2019 · 15 comments
Closed

read_parquet S3 dir support #26388

Seanspt opened this issue May 14, 2019 · 15 comments
Assignees
Labels
Milestone

Comments

@Seanspt
Copy link

Seanspt commented May 14, 2019

When using spark to process data and save to s3, the files are like

s3://bucket/path/_SUCCESS
s3://bucket/path/part-00000-uuid.snappy.parquet
s3://bucket/path/part-00002-uuid.snappy.parquet
s3://bucket/path/part-00001-uuid.snappy.parquet
s3://bucket/path/part-00003-uuid.snappy.parquet

It would be nice to load these files in one line.

df = pd.read_parquet("s3://bucket/path")
@jreback
Copy link
Contributor

jreback commented May 14, 2019

pls file your issue on the arrow tracker

these are just passed thru

@jreback jreback closed this as completed May 14, 2019
@jreback jreback added the IO Parquet parquet, feather label May 14, 2019
@jreback jreback added this to the No action milestone May 14, 2019
@jorisvandenbossche
Copy link
Member

Both pyarrow and fastparquet support reading from a directory of files. So it can also be an issue in how we call those libraries.

@Seanspt what is the error that you get?

@jorisvandenbossche jorisvandenbossche removed this from the No action milestone May 14, 2019
@Seanspt
Copy link
Author

Seanspt commented May 16, 2019

Pandas works fine if I download the spark-saved dir and read it by passing a local dir path.
However, when using s3:// dir path, the error got is as followed.

>>> pd.read_parquet("s3://xxx/xxx/df")
Traceback (most recent call last):
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 524, in info
    Key=key, **self.req_kw)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 195, in _call_s3
    return method(**additional_kwargs)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 320, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 624, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/s3.py", line 29, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 350, in open
    s3_additional_kwargs=kw)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1206, in __init__
    info = self.info()
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1224, in info
    refresh=refresh, **kwargs)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 535, in info
    raise_from(FileNotFoundError(path), e)
  File "<string>", line 3, in raise_from
FileNotFoundError: xxx/xxx/df

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 524, in info
    Key=key, **self.req_kw)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 195, in _call_s3
    return method(**additional_kwargs)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 320, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/botocore/client.py", line 624, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/parquet.py", line 288, in read_parquet
    return impl.read(path, columns=columns, **kwargs)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/parquet.py", line 124, in read
    path, _, _, should_close = get_filepath_or_buffer(path)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/common.py", line 209, in get_filepath_or_buffer
    mode=mode)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/pandas/io/s3.py", line 38, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 350, in open
    s3_additional_kwargs=kw)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1206, in __init__
    info = self.info()
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 1224, in info
    refresh=refresh, **kwargs)
  File "/Users/sean/miniconda3/envs/py3/lib/python3.6/site-packages/s3fs/core.py", line 535, in info
    raise_from(FileNotFoundError(path), e)
  File "<string>", line 3, in raise_from
FileNotFoundError: xxx/xxx/df

@Seanspt
Copy link
Author

Seanspt commented May 16, 2019

As a reference here. Pyarrow and fast parquet can do this in several lines of code.

@jorisvandenbossche
Copy link
Member

Thanks for the follow-up!

If you read the directory on s3 directly with pyarrow as done in the answer of the StackOverflow question, then it also works for you? Can you show that?

@Seanspt
Copy link
Author

Seanspt commented May 17, 2019

Yes, the following code works for me.

import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem(region_name='cn-north-1')

bucket_uri = f's3://xxx/xxx/df'

dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas() 

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 17, 2019

Thanks!

@TomAugspurger I suppose the problem is that our s3 compat code assumes it is a single file and tries to open it? (read_parquet also accepts a directory name to read)
We use S3FileSystem.open(..), which is for single files.

I suppose one option might be to, in case of an s3 path, not to use our get_filepath_or_buffer but let pyarrow handle the s3?

For reference, the PR where this was added: #19135

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 17, 2019 via email

@alimcmaster1
Copy link
Member

Thanks!

@TomAugspurger I suppose the problem is that our s3 compat code assumes it is a single file and tries to open it? (read_parquet also accepts a directory name to read)
We use S3FileSystem.open(..), which is for single files.

I suppose one option might be to, in case of an s3 path, not to use our get_filepath_or_buffer but let pyarrow handle the s3?

For reference, the PR where this was added: #19135

@jorisvandenbossche i'm happy to take a look at this.

You ideas make sense to me. Think we can get rid of our get_filepath_or_buffer and pyarrow.parquet.read_table here and generalise to ParquetDataset any thoughts/concerns?

@alimcmaster1 alimcmaster1 self-assigned this Apr 17, 2020
@jreback jreback added this to the 1.1 milestone Apr 22, 2020
@jreback
Copy link
Contributor

jreback commented Apr 26, 2020

closed by #33632

@ghost
Copy link

ghost commented Jun 30, 2020

Can we reopen this issue? As mentioned in the 1.0.5 release notes (#34785) this is not yet fixed, and took me a while to find out why I was still getting it in 1.0.5 despite the fact it was fixed in 1.0.4

@alimcmaster1
Copy link
Member

Can we reopen this issue? As mentioned in the 1.0.5 release notes (#34785) this is not yet fixed, and took me a while to find out why I was still getting it in 1.0.5 despite the fact it was fixed in 1.0.4

It’s being released in pandas 1.1 -> happy to reopen if you have issues.

Perhaps try build master and check it works as expected?

@animesh-wynk
Copy link

Yes, the following code works for me.

import s3fs
import pyarrow.parquet as pq
fs = s3fs.S3FileSystem(region_name='cn-north-1')

bucket_uri = f's3://xxx/xxx/df'

dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas() 

But what if the parquet file size is huge? In such cases, can I make python generators over the parquet file? If yes, how?
Thanks in advance: )

@ilivans
Copy link

ilivans commented Jun 11, 2021

I'm getting the same exception here
Although if I specify one partition it works 🤔 :

df = pd.read_parquet("bucket/key/country=X", filesystem=s3, filters=[('day', '=', 10)])

With the source directory it doesn't:

df = pd.read_parquet("bucket/key", filesystem=s3, filters=[('day', '=', 10)])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-6-6ebb8c990af1> in <module>
----> 1 df = pd.read_parquet("bucket/key", filesystem=s3, filters=[('day', '=', 10)])

~/.virtualenvs/ltr/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
    457     """
    458     impl = get_engine(engine)
--> 459     return impl.read(
    460         path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
    461     )

~/.virtualenvs/ltr/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    219         )
    220         try:
--> 221             return self.api.parquet.read_table(
    222                 path_or_handle, columns=columns, **kwargs
    223             ).to_pandas(**to_pandas_kwargs)

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes)
   1698             )
   1699         try:
-> 1700             dataset = _ParquetDatasetV2(
   1701                 source,
   1702                 filesystem=filesystem,

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, **kwargs)
   1556                 infer_dictionary=True)
   1557 
-> 1558         self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
   1559                                    format=parquet_format,
   1560                                    partitioning=partitioning,

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    654 
    655     if _is_path_like(source):
--> 656         return _filesystem_dataset(source, **kwargs)
    657     elif isinstance(source, (tuple, list)):
    658         if all(_is_path_like(elem) for elem in source):

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    409     factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
    410 
--> 411     return factory.finish(schema)
    412 
    413 

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.DatasetFactory.finish()

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file()

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/fs.py in open_input_file(self, path)
    312 
    313         if not self.fs.isfile(path):
--> 314             raise FileNotFoundError(path)
    315 
    316         return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: bucket/key/

In both cases: import s3fs; s3 = s3fs.core.S3FileSystem()
Environment:

pandas==1.2.4
pyarrow==4.0.1
s3fs==0.4.2

@lu-liu-rft
Copy link

I'm getting the same exception here
Although if I specify one partition it works 🤔 :

df = pd.read_parquet("bucket/key/country=X", filesystem=s3, filters=[('day', '=', 10)])

With the source directory it doesn't:

df = pd.read_parquet("bucket/key", filesystem=s3, filters=[('day', '=', 10)])
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-6-6ebb8c990af1> in <module>
----> 1 df = pd.read_parquet("bucket/key", filesystem=s3, filters=[('day', '=', 10)])

~/.virtualenvs/ltr/lib/python3.8/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, use_nullable_dtypes, **kwargs)
    457     """
    458     impl = get_engine(engine)
--> 459     return impl.read(
    460         path, columns=columns, use_nullable_dtypes=use_nullable_dtypes, **kwargs
    461     )

~/.virtualenvs/ltr/lib/python3.8/site-packages/pandas/io/parquet.py in read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    219         )
    220         try:
--> 221             return self.api.parquet.read_table(
    222                 path_or_handle, columns=columns, **kwargs
    223             ).to_pandas(**to_pandas_kwargs)

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/parquet.py in read_table(source, columns, use_threads, metadata, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes)
   1698             )
   1699         try:
-> 1700             dataset = _ParquetDatasetV2(
   1701                 source,
   1702                 filesystem=filesystem,

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, **kwargs)
   1556                 infer_dictionary=True)
   1557 
-> 1558         self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
   1559                                    format=parquet_format,
   1560                                    partitioning=partitioning,

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    654 
    655     if _is_path_like(source):
--> 656         return _filesystem_dataset(source, **kwargs)
    657     elif isinstance(source, (tuple, list)):
    658         if all(_is_path_like(elem) for elem in source):

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    409     factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
    410 
--> 411     return factory.finish(schema)
    412 
    413 

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.DatasetFactory.finish()

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file()

~/.virtualenvs/ltr/lib/python3.8/site-packages/pyarrow/fs.py in open_input_file(self, path)
    312 
    313         if not self.fs.isfile(path):
--> 314             raise FileNotFoundError(path)
    315 
    316         return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: bucket/key/

In both cases: import s3fs; s3 = s3fs.core.S3FileSystem()
Environment:

pandas==1.2.4
pyarrow==4.0.1
s3fs==0.4.2

Hi I'm seeing the same issue, just wondering if you have a way to resolve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants