Skip to content

to_parquet fails when S3 is the destination #19134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maximveksler opened this issue Jan 8, 2018 · 4 comments · Fixed by #19135
Closed

to_parquet fails when S3 is the destination #19134

maximveksler opened this issue Jan 8, 2018 · 4 comments · Fixed by #19135
Labels
IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather
Milestone

Comments

@maximveksler
Copy link
Contributor

maximveksler commented Jan 8, 2018

import pandas as pd

df = pd.DataFrame({'field': [1,2,3]})
df.to_parquet("s3://pandas-test/test.parquet", engine='pyarrow')

Problem description

pandas uses S3FS for writing files to S3. S3File objects are being opened in rb mode.

There are several possible fail cases in the form of an exceptions in the fail chain.

3 different components

  • S3Filesystem,
  • pyarrow writer,
  • fastparquet reader & writer.

pyarrow - write attempt

FileNotFoundError or ValueError (depends on if file exists in S3 or not).

fastparquet - read attempt

Exception in attempting to concat str and S3File

fastparquet - write attempt

Exception in attempting to open path using default_open

The above code produces

C:\Users\maxim.veksler\source\DTank\venv\Scripts\python.exe C:/Users/maxim.veksler/source/DTank/DTank/par.py
Traceback (most recent call last):
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 396, in info
    kwargs, Bucket=bucket, Key=key, **self.req_kw)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 170, in _call_s3
    return method(**additional_kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\users\maxim.veksler\source\pandas\pandas\io\s3.py", line 25, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer))
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 293, in open
    fill_cache=fill_cache, s3_additional_kwargs=kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 931, in __init__
    self.size = self.info()['Size']
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 940, in info
    return self.s3.info(self.path, **kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 402, in info
    raise FileNotFoundError(path)
FileNotFoundError: pandas-test/test.parquet

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 396, in info
    kwargs, Bucket=bucket, Key=key, **self.req_kw)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 170, in _call_s3
    return method(**additional_kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\botocore\client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/maxim.veksler/source/DTank/DTank/par.py", line 4, in <module>
    df.to_parquet("s3://pandas-test/test.parquet", engine='pyarrow')
  File "c:\users\maxim.veksler\source\pandas\pandas\core\frame.py", line 1649, in to_parquet
    compression=compression, **kwargs)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\parquet.py", line 227, in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\parquet.py", line 110, in write
    path, _, _ = get_filepath_or_buffer(path)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\common.py", line 202, in get_filepath_or_buffer
    compression=compression)
  File "c:\users\maxim.veksler\source\pandas\pandas\io\s3.py", line 34, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer))
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 293, in open
    fill_cache=fill_cache, s3_additional_kwargs=kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 931, in __init__
    self.size = self.info()['Size']
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 940, in info
    return self.s3.info(self.path, **kwargs)
  File "C:\Users\maxim.veksler\source\DTank\venv\lib\site-packages\s3fs\core.py", line 402, in info
    raise FileNotFoundError(path)
FileNotFoundError: pandas-test/test.parquet

Process finished with exit code 1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 2.8.0
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.13.3
scipy: None
pyarrow: 0.7.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jan 8, 2018
@TomAugspurger TomAugspurger added IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather labels Jan 8, 2018
@jreback
Copy link
Contributor

jreback commented Jan 9, 2018

cc @martindurant

@jreback
Copy link
Contributor

jreback commented Jan 9, 2018

have you tried this the default parquet write (install pyarrow)?

@martindurant
Copy link
Contributor

You mean bucket does not exist or isn't writable?
In S3, it is not a simple matter to ascertain whether a user has write access to some location, the easiest thing to do often is just to try.

@maximveksler
Copy link
Contributor Author

@jreback, @martindurant yes, pyarrow is installed and found.

The issue is that the S3File is being opened in 'rb' mode, please see PR for the details of the fix.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 10, 2018
@maximveksler maximveksler changed the title to_parquet fails when S3 path is does not exist to_parquet fails when S3 is the destination Jan 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants