Cannot write partitioned parquet file to S3 #27596

jkleint · 2019-07-25T22:58:55Z

Apologies if this is a pyarrow issue.

Code Sample, a copy-pastable example if possible

pd.DataFrame({'a': range(5), 'b': range(5)}).to_parquet('s3://mybucket', partition_cols=['b'])

Problem description

Fails with AttributeError: 'NoneType' object has no attribute '_isfilestore'

Traceback (most recent call last):
  File "/python/partparqs3.py", line 8, in <module>
    pd.DataFrame({'a': range(5), 'b': range(5)}).to_parquet('s3://mybucket', partition_cols=['b'])
  File "/python/lib/python3.7/site-packages/pandas/core/frame.py", line 2203, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1292, in _mkdir_if_not_exists
    if fs._isfilestore() and not fs.exists(path):
AttributeError: 'NoneType' object has no attribute '_isfilestore'
Exception ignored in: <function AbstractBufferedFile.__del__ at 0x7f529985ca60>
Traceback (most recent call last):
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 1146, in __del__
    self.close()
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 1124, in close
    self.flush(force=True)
  File "/python/lib/python3.7/site-packages/fsspec/spec.py", line 996, in flush
    self._initiate_upload()
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 941, in _initiate_upload
    Bucket=bucket, Key=key, ACL=self.acl)
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 928, in _call_s3
    **kwargs)
  File "/python/lib/python3.7/site-packages/s3fs/core.py", line 182, in _call_s3
    return method(**additional_kwargs)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 648, in _make_api_call
    operation_model, request_dict, request_context)
  File "/python/lib/python3.7/site-packages/botocore/client.py", line 667, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 137, in _send_request
    success_response, exception):
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 231, in _needs_retry
    caught_exception=caught_exception, request_dict=request_dict)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/python/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 251, in __call__
    caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 317, in __call__
    caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 223, in __call__
    attempt_number, caught_exception)
  File "/python/lib/python3.7/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 200, in _do_get_response
    http_response = self._send(request)
  File "/python/lib/python3.7/site-packages/botocore/endpoint.py", line 244, in _send
    return self.http_session.send(request)
  File "/python/lib/python3.7/site-packages/botocore/httpsession.py", line 294, in send
    raise HTTPClientError(error=e)
botocore.exceptions.HTTPClientError: An HTTP Client raised and unhandled exception: 'NoneType' object is not iterable

Expected Output

Expected to see partitioned data show up in S3.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.21.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 41.0.0
Cython: 0.29.7
numpy: 1.16.2
scipy: 1.3.0
pyarrow: 0.14.0
xarray: None
IPython: 7.5.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.3
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: 0.3.0
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-07-29T19:02:34Z

Can you verify that the path we pass to write_to_dataset in

  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)

is correct? pyarrow may want a FileSystem-type thing.

cottrell · 2019-09-12T13:48:00Z

https://issues.apache.org/jira/browse/ARROW-5156

cottrell · 2019-09-12T13:50:22Z

It sounds like usage of s3fs should largely be replaced with fsspec. Can somebody confirm that is true? I think the bug here is probably some clean up in the io/parquet.py to do with that but there might be plans already in progress?

TomAugspurger · 2019-09-12T18:01:14Z

fsspec is a dependency of s3fs. It provides the backend-agnostic parts to various filesystem-like things. s3fs is still the only relevant dependency for pandas.

…

On Thu, Sep 12, 2019 at 8:50 AM David Cottrell ***@***.***> wrote: It sounds like usage of s3fs should largely be replaced with fsspec. Can somebody confirm that is true? I think the bug here is probably some clean up in the io/parquet.py to do with that but there might be plans already in progress? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27596?email_source=notifications&email_token=AAKAOIREUL4LYOGYZ422BNLQJJCKLA5CNFSM4IG7BWM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6R6QRQ#issuecomment-530835526>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOITXVDAZEFXZGA72OLTQJJCKLANCNFSM4IG7BWMQ> .

getsanjeevdubey · 2020-02-01T07:37:51Z

@TomAugspurger @cottrell Is this fixed, whats the work around? Please help.

daviddelucca · 2020-02-03T13:09:11Z

@getsanjeevdubey I think stills opened. You should write into disk and upload files to s3 manually

jkleint · 2020-02-25T19:28:06Z

Writing partitioned parquet to S3 is still an issue with Pandas 1.0.1, pyarrow 0.16, and s3fs 0.4.

@TomAugspurger the root_path passed to write_to_dataset looks like <File-like object S3FileSystem, mybucket>.

@getsanjeevdubey you can work around this by giving PyArrow an S3FileSystem directly:

import pandas as pd
import pyarrow
import pyarrow.parquet as pq
import s3fs

pq.write_to_dataset(pyarrow.Table.from_pandas(dataframe), s3bucket, 
                    filesystem=s3fs.S3FileSystem(), partition_cols=['b'])

Of course you'll have to special-case this for S3 paths vs. other destinations for .to_parquet().

alimcmaster1 · 2020-04-18T16:05:57Z

Can you verify that the path we pass to write_to_dataset in

  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 252, in to_parquet
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pandas/io/parquet.py", line 118, in write
    partition_cols=partition_cols, **kwargs)
  File "/python/lib/python3.7/site-packages/pyarrow/parquet.py", line 1342, in write_to_dataset
    _mkdir_if_not_exists(fs, root_path)

is correct? pyarrow may want a FileSystem-type thing.

Yep looks like this is exactly the problem. Should be fixed after #33632 and if filesytem kwarg is passed

jreback · 2020-04-26T21:41:54Z

closed by #33632

jbrockmendel added the IO Parquet parquet, feather label Jul 26, 2019

alimcmaster1 self-assigned this Apr 18, 2020

alimcmaster1 mentioned this issue Apr 18, 2020

IO: Fix parquet read from s3 directory #33632

Merged

6 tasks

mroeschke added the Bug label Apr 19, 2020

jreback added this to the 1.1 milestone Apr 22, 2020

jreback closed this as completed Apr 26, 2020

simonjayhawkins modified the milestones: 1.1, 1.0.4 May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot write partitioned parquet file to S3 #27596

Cannot write partitioned parquet file to S3 #27596

jkleint commented Jul 25, 2019

INSTALLED VERSIONS

TomAugspurger commented Jul 29, 2019

cottrell commented Sep 12, 2019

cottrell commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019 via email

getsanjeevdubey commented Feb 1, 2020

daviddelucca commented Feb 3, 2020

jkleint commented Feb 25, 2020

alimcmaster1 commented Apr 18, 2020

jreback commented Apr 26, 2020

Cannot write partitioned parquet file to S3 #27596

Cannot write partitioned parquet file to S3 #27596

Comments

jkleint commented Jul 25, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 29, 2019

cottrell commented Sep 12, 2019

cottrell commented Sep 12, 2019

TomAugspurger commented Sep 12, 2019 via email

getsanjeevdubey commented Feb 1, 2020

daviddelucca commented Feb 3, 2020

jkleint commented Feb 25, 2020

alimcmaster1 commented Apr 18, 2020

jreback commented Apr 26, 2020

Output of `pd.show_versions()`