Skip to content

unable to write JSON to S3 use to_json #28375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
juls858 opened this issue Sep 10, 2019 · 13 comments · Fixed by #31552
Closed

unable to write JSON to S3 use to_json #28375

juls858 opened this issue Sep 10, 2019 · 13 comments · Fixed by #31552
Labels
IO JSON read_json, to_json, json_normalize IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Milestone

Comments

@juls858
Copy link

juls858 commented Sep 10, 2019

Code Sample, a copy-pastable example if possible

S3 paths work for reading and writing CSV. However, I am not able to write json files using the to_json method. Reading json from an S3 path seems to work just fine.

df.to_json(s3uri, orient='values')

Problem description

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
 in 
      1 final_users['user_id'].to_json(s3_path + 'users.json',
----> 2                                orient='values')

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
   2271                             default_handler=default_handler,
   2272                             lines=lines, compression=compression,
-> 2273                             index=index)
   2274 
   2275     def to_hdf(self, path_or_buf, key, **kwargs):

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/json/json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index)
     66 
     67     if isinstance(path_or_buf, compat.string_types):
---> 68         fh, handles = _get_handle(path_or_buf, 'w', compression=compression)
     69         try:
     70             fh.write(s)

~/anaconda3/envs/qa-tool/lib/python3.7/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    425         elif is_text:
    426             # Python 3 and no explicit encoding
--> 427             f = open(path_or_buf, mode, errors='replace', newline="")
    428         else:
    429             # Python 3 and binary mode

FileNotFoundError: [Errno 2] No such file or directory: 's3://bucket/folder/foo.json'

Expected Output

None. Expected output is no error and the file is written to the s3 bucket.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None

pandas: 0.24.2
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: 1.3.1
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: 0.2.1
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2019

Thanks for the report. If you'd like to investigate and push a PR to fix would certainly be welcome

@WillAyd WillAyd added IO Data IO issues that don't fit into a more specific label IO JSON read_json, to_json, json_normalize labels Sep 10, 2019
@lanhoter
Copy link

lanhoter commented Sep 14, 2019

I think you can try use StringIO if you are on python3, python2 should be BytesIO

import boto3
import io
import pandas as pd

jsonBuffer = io.StringIO()

Cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

df = pd.DataFrame(Cars,columns= ['Brand', 'Price'])
session = boto3.Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
s3 = session.resource('s3')
df.to_json(outputBuffer, orient='values');
s3.Object(Bucket, 'PATH/TO/test.json').put(Body= jsonBuffer.getvalue())

@jbrockmendel jbrockmendel added IO Network Local or Cloud (AWS, GCS, etc.) IO Issues and removed IO Data IO issues that don't fit into a more specific label labels Dec 12, 2019
YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020
YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020
YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020
@jreback jreback added this to the 1.1 milestone Feb 1, 2020
YagoGG added a commit to YagoGG/pandas that referenced this issue Feb 1, 2020
@rohan-gt
Copy link

Why was this issue closed? The problem still persists

@jreback
Copy link
Contributor

jreback commented Jun 26, 2020

@rohitkg98 issues are closed when the corresponding PR is merged
if you have a reproducible example that doesn’t work on master pls open a new issues

@rjurney
Copy link

rjurney commented Jul 1, 2020

@jreback This is still broken using the latest pandas 1.0.5 on Anaconda Python 3.6.10 on the Amazon Deep Learning Ubuntu image. it is clearly trying to do a simple open() on the filesystem with no S3 handling whatsoever.

import numpy as np
import pandas as pd
import pyarrow
import s3fs
import torch

...

    df.to_json(
        output_path,
        orient='records',
        lines=True,
    )
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
    lines=True,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
    indent=indent,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
    fh, handles = get_handle(path_or_buf, "w", compression=compression)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
    f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ingestion/stackoverflow/encode_bert_docs.py", line 53, in <module>
    main()
  File "ingestion/stackoverflow/encode_bert_docs.py", line 41, in main
    lengths = pool.map(encode_records, device_numbers)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/data/matcha.doodle.backend/ingestion/stackoverflow/encode_records.py", line 95, in encode_records
    lines=True,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/core/generic.py", line 2364, in to_json
    indent=indent,
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/json/_json.py", line 92, in to_json
    fh, handles = get_handle(path_or_buf, "w", compression=compression)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pandas/io/common.py", line 431, in get_handle
    f = open(path_or_buf, mode, errors="replace", newline="")
FileNotFoundError: [Errno 2] No such file or directory: 's3://mypath/0.json'

@jreback
Copy link
Contributor

jreback commented Jul 1, 2020

@rjurney this was merged for 1.1 which has not been released

@rjurney
Copy link

rjurney commented Jul 1, 2020

@jreback thanks, any idea when it will be released?

@jreback
Copy link
Contributor

jreback commented Jul 1, 2020

likely in july

@manugarri
Copy link

@jreback I think with 1.1.0 this functionality is back in fact! however there is an issue when specifying compression.

For example, writing a dataframe to s3 like this:
df.to_json("s3:/bucket_key_path.gz", compression="gzip") writes the file as plain text and not compressed.

Should I create a new issue to tackle this?

@jreback
Copy link
Contributor

jreback commented Aug 14, 2020

you can create a new issue

@imanebosch
Copy link

@manugarri, HAve your already created the issue? Can you link it? Thanks

@manugarri
Copy link

@imanebosch i did not, i ended up using dask directly.

@abin-tiger
Copy link

Still facing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants