unable to write JSON to S3 use to_json #28375

juls858 · 2019-09-10T20:00:11Z

Code Sample, a copy-pastable example if possible

S3 paths work for reading and writing CSV. However, I am not able to write json files using the to_json method. Reading json from an S3 path seems to work just fine.

df.to_json(s3uri, orient='values')

Problem description

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
 in 
      1 final_users['user_id'].to_json(s3_path + 'users.json',
----> 2                                orient='values')

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
   2271                             default_handler=default_handler,
   2272                             lines=lines, compression=compression,
-> 2273                             index=index)
   2274 
   2275     def to_hdf(self, path_or_buf, key, **kwargs):

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/json/json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
     66 
     67     if isinstance(path_or_buf, compat.string_types):
---> 68         fh, handles = _get_handle(path_or_buf, 'w', compression=compression)
     69         try:
     70             fh.write(s)

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    425         elif is_text:
    426             # Python 3 and no explicit encoding
--> 427             f = open(path_or_buf, mode, errors='replace', newline="")
    428         else:
    429             # Python 3 and binary mode

FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/folder/foo.json'

Expected Output

None. Expected output is no error and the file is written to the s3 bucket.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None

pandas: 0.24.2
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: 0.2.1
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-09-10T21:34:50Z

Thanks for the report. If you'd like to investigate and push a PR to fix would certainly be welcome

lanhoter · 2019-09-14T09:29:18Z

I think you can try use StringIO if you are on python3, python2 should be BytesIO

import boto3
import io
import pandas as pd

jsonBuffer = io.StringIO()

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])
session = boto3.Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
s3 = session.resource('s3')
df.to_json(outputBuffer, orient='values');
s3.Object(Bucket, 'PATH/TO/test.json').put(Body= jsonBuffer.getvalue())

rohan-gt · 2020-06-26T12:53:56Z

Why was this issue closed? The problem still persists

jreback · 2020-06-26T13:23:28Z

@rohitkg98 issues are closed when the corresponding PR is merged
if you have a reproducible example that doesn’t work on master pls open a new issues

rjurney · 2020-07-01T14:36:34Z

@jreback This is still broken using the latest pandas 1.0.5 on Anaconda Python 3.6.10 on the Amazon Deep Learning Ubuntu image. it is clearly trying to do a simple open() on the filesystem with no S3 handling whatsoever.

import numpy as np
import pandas as pd
import pyarrow
import s3fs
import torch

...

    df.to_json(
        output_path,
        orient='records',
        lines=True,
    )

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
    lines=True,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
    indent=indent,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
    fh, handles = get_handle(path_or_buf, "w", compression=compression)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
    f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ingestion/stackoverflow/encode_bert_docs.py", line 53, in <module>
    main()
  File "ingestion/stackoverflow/encode_bert_docs.py", line 41, in main
    lengths = pool.map(encode_records, device_numbers)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
    lines=True,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
    indent=indent,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
    fh, handles = get_handle(path_or_buf, "w", compression=compression)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
    f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'

jreback · 2020-07-01T14:45:24Z

@rjurney this was merged for 1.1 which has not been released

rjurney · 2020-07-01T16:16:31Z

@jreback thanks, any idea when it will be released?

jreback · 2020-07-01T17:28:11Z

likely in july

manugarri · 2020-08-14T15:42:37Z

@jreback I think with 1.1.0 this functionality is back in fact! however there is an issue when specifying compression.

For example, writing a dataframe to s3 like this:
df.to_json("s3:/bucket_key_path.gz", compression="gzip") writes the file as plain text and not compressed.

Should I create a new issue to tackle this?

jreback · 2020-08-14T23:31:57Z

you can create a new issue

imanebosch · 2020-11-02T12:12:36Z

@manugarri, HAve your already created the issue? Can you link it? Thanks

manugarri · 2020-11-02T14:58:56Z

@imanebosch i did not, i ended up using dask directly.

abin-tiger · 2021-02-24T15:11:11Z

Still facing this issue

WillAyd added IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize labels Sep 10, 2019

jbrockmendel added IO Network Local or Cloud (AWS, GCS, etc.) IO Issues and removed IO Data IO issues that don't fit into a more specific label labels Dec 12, 2019

YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020

BUG: to_json not allowing uploads to S3 (pandas-dev#28375)

4ac6eb1

YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020

BUG: to_json not allowing uploads to S3 (pandas-dev#28375)

7eaf7b2

YagoGG mentioned this issue Feb 1, 2020

BUG: to_json not allowing uploads to S3 (#28375) #31552

Merged

5 tasks

YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020

BUG: to_json not allowing uploads to S3 (pandas-dev#28375)

b787862

jreback added this to the 1.1 milestone Feb 1, 2020

YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020

BUG: to_json not allowing uploads to S3 (pandas-dev#28375)

dd2dc47

jreback closed this as completed in #31552 Feb 2, 2020

jreback pushed a commit that referenced this issue Feb 2, 2020

BUG: to_json not allowing uploads to S3 (#28375) (#31552)

e2ae57c

igorborgest mentioned this issue Sep 30, 2020

to_json Compression doesnt work aws/aws-sdk-pandas#412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to write JSON to S3 use to_json #28375

unable to write JSON to S3 use to_json #28375

juls858 commented Sep 10, 2019

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented Sep 10, 2019

lanhoter commented Sep 14, 2019 •

edited

Loading

rohan-gt commented Jun 26, 2020

jreback commented Jun 26, 2020

rjurney commented Jul 1, 2020 •

edited

Loading

jreback commented Jul 1, 2020

rjurney commented Jul 1, 2020

jreback commented Jul 1, 2020

manugarri commented Aug 14, 2020

jreback commented Aug 14, 2020

imanebosch commented Nov 2, 2020

manugarri commented Nov 2, 2020

abin-tiger commented Feb 24, 2021

unable to write JSON to S3 use to_json #28375

unable to write JSON to S3 use to_json #28375

Comments

juls858 commented Sep 10, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

WillAyd commented Sep 10, 2019

lanhoter commented Sep 14, 2019 • edited Loading

rohan-gt commented Jun 26, 2020

jreback commented Jun 26, 2020

rjurney commented Jul 1, 2020 • edited Loading

jreback commented Jul 1, 2020

rjurney commented Jul 1, 2020

jreback commented Jul 1, 2020

manugarri commented Aug 14, 2020

jreback commented Aug 14, 2020

imanebosch commented Nov 2, 2020

manugarri commented Nov 2, 2020

abin-tiger commented Feb 24, 2021

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

lanhoter commented Sep 14, 2019 •

edited

Loading

rjurney commented Jul 1, 2020 •

edited

Loading