Skip to content

Cannot write partitioned parquet file to S3 #27596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jkleint opened this issue Jul 25, 2019 · 9 comments
Closed

Cannot write partitioned parquet file to S3 #27596

jkleint opened this issue Jul 25, 2019 · 9 comments
Assignees
Labels
Bug IO Parquet parquet, feather
Milestone

Comments

@jkleint
Copy link

jkleint commented Jul 25, 2019

Apologies if this is a pyarrow issue.

Code Sample, a copy-pastable example if possible

pd.DataFrame({'a': range(5), 'b': range(5)}).to_parquet('s3://mybucket', partition_cols=['b'])

Problem description

Fails with AttributeError: 'NoneType' object has no attribute '_isfilestore'

Traceback (most recent call last):
  File "/python/partparqs3.py", line 8, in <module>
    pd.DataFrame({'a': range(5), 'b': range(5)}).to_parquet('s3://mybucket', partition_cols=['b'])
  File "/python/lib/python3.7/site-packages/pandas/core/frame.py", line 2203, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1292, in _mkdir_if_not_exists
    if fs._isfilestore() and not fs.exists(path):
AttributeError: 'NoneType' object has no attribute '_isfilestore'
Exception ignored in: <function AbstractBufferedFile.__del__ at 0x7f529985ca60>
Traceback (most recent call last):
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 1146, in __del__
    self.close()
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 1124, in close
    self.flush(force=True)
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 996, in flush
    self._initiate_upload()
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 941, in _initiate_upload
    Bucket=bucket, Key=key, ACL=self.acl)
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 928, in _call_s3
    **kwargs)
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 182, in _call_s3
    return method(**additional_kwargs)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 648, in _make_api_call
    operation_model, request_dict, request_context)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 667, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 137, in _send_request
    success_response, exception):
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 231, in _needs_retry
    caught_exception=caught_exception, request_dict=request_dict)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 251, in __call__
    caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 317, in __call__
    caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 223, in __call__
    attempt_number, caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 200, in _do_get_response
    http_response = self._send(request)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 244, in _send
    return self.http_session.send(request)
  File "/python/lib/python3.7/site-packages/botocore/httpsession.py", line 294, in send
    raise HTTPClientError(error=e)
botocore.exceptions.HTTPClientError: An HTTP Client raised and unhandled exception: 'NoneType' object is not iterable

Expected Output

Expected to see partitioned data show up in S3.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.21.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 41.0.0
Cython: 0.29.7
numpy: 1.16.2
scipy: 1.3.0
pyarrow: 0.14.0
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: 0.3.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jbrockmendel jbrockmendel added the IO Parquet parquet, feather label Jul 26, 2019
@TomAugspurger
Copy link
Contributor

Can you verify that the path we pass to write_to_dataset in

  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)

is correct? pyarrow may want a FileSystem-type thing.

@cottrell
Copy link
Contributor

@cottrell
Copy link
Contributor

It sounds like usage of s3fs should largely be replaced with fsspec. Can somebody confirm that is true? I think the bug here is probably some clean up in the io/parquet.py to do with that but there might be plans already in progress?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 12, 2019 via email

@getsanjeevdubey
Copy link

@TomAugspurger @cottrell Is this fixed, whats the work around? Please help.

@daviddelucca
Copy link

@getsanjeevdubey I think stills opened. You should write into disk and upload files to s3 manually

@jkleint
Copy link
Author

jkleint commented Feb 25, 2020

Writing partitioned parquet to S3 is still an issue with Pandas 1.0.1, pyarrow 0.16, and s3fs 0.4.

@TomAugspurger the root_path passed to write_to_dataset looks like <File-like object S3FileSystem, mybucket>.

@getsanjeevdubey you can work around this by giving PyArrow an S3FileSystem directly:

import pandas as pd
import pyarrow
import pyarrow.parquet as pq
import s3fs

pq.write_to_dataset(pyarrow.Table.from_pandas(dataframe), s3bucket, 
                    filesystem=s3fs.S3FileSystem(), partition_cols=['b'])

Of course you'll have to special-case this for S3 paths vs. other destinations for .to_parquet().

@alimcmaster1
Copy link
Member

Can you verify that the path we pass to write_to_dataset in

  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)

is correct? pyarrow may want a FileSystem-type thing.

Yep looks like this is exactly the problem. Should be fixed after #33632 and if filesytem kwarg is passed

@mroeschke mroeschke added the Bug label Apr 19, 2020
@jreback jreback added this to the 1.1 milestone Apr 22, 2020
@jreback
Copy link
Contributor

jreback commented Apr 26, 2020

closed by #33632

@jreback jreback closed this as completed Apr 26, 2020
@simonjayhawkins simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

10 participants