Skip to content

COMPAT: reading json with lines=True from s3, xref #17200 #17201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Nov 27, 2017
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/requirements-2.7.pip
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ py
PyCrypto
mock
ipython
moto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed this is not always installed

2 changes: 1 addition & 1 deletion ci/requirements-2.7.run
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ psycopg2
patsy
pymysql=0.6.3
jinja2=2.8
xarray=0.8.0
xarray=0.8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this file

2 changes: 1 addition & 1 deletion ci/requirements-2.7_SLOW.run
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ s3fs
psycopg2
pymysql
html5lib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

beautiful-soup
beautiful-soup
1 change: 1 addition & 0 deletions ci/requirements-3.5.pip
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
xarray==0.9.1
pandas-gbq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needds to be reverted

moto
2 changes: 1 addition & 1 deletion ci/requirements-3.5.run
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,4 @@ pymysql
psycopg2
s3fs
beautifulsoup4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

ipython
ipython
2 changes: 1 addition & 1 deletion ci/requirements-3.5_OSX.run
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ jinja2
bottleneck
xarray
s3fs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

beautifulsoup4
beautifulsoup4
1 change: 1 addition & 0 deletions ci/requirements-3.6.pip
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
brotlipy
moto
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

2 changes: 1 addition & 1 deletion ci/requirements-3.6.run
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ fastparquet
beautifulsoup4
s3fs
xarray
ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

ipython
2 changes: 1 addition & 1 deletion ci/requirements-3.6_LOCALE.run
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ psycopg2
beautifulsoup4
s3fs
xarray
ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

ipython
2 changes: 1 addition & 1 deletion ci/requirements-3.6_LOCALE_SLOW.run
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ psycopg2
beautifulsoup4
s3fs
xarray
ipython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

ipython
1 change: 1 addition & 0 deletions ci/requirements_all.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ sqlalchemy
bottleneck
pymysql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is ok

Jinja2
s3fs
1 change: 1 addition & 0 deletions ci/requirements_dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ cython
pytest>=3.1.0
pytest-cov
flake8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this, s3fs is NOT a requirement for dev; we should be robust to not having this installed

s3fs
moto
2 changes: 1 addition & 1 deletion pandas/io/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
if _is_s3_url(filepath_or_buffer):
from pandas.io import s3
return s3.get_filepath_or_buffer(filepath_or_buffer,
encoding=encoding,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add spaces

encoding=encoding,
compression=compression)

if isinstance(filepath_or_buffer, (compat.string_types,
Expand Down
9 changes: 9 additions & 0 deletions pandas/io/json/json.py
Original file line number Diff line number Diff line change
Expand Up @@ -340,12 +340,21 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
json = filepath_or_buffer
elif hasattr(filepath_or_buffer, 'read'):
json = filepath_or_buffer.read()

else:
json = filepath_or_buffer

if lines:
# If given a json lines file, we break the string into lines, add
# commas and put it in a json list to make a valid json object.

"""
Handle encoded bytes arrays in PY3 and bytes objects from certain
readables before using StringIO.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a 1-line comment is fine;

if PY3 and isinstance(json, bytes):
    ...

if isinstance(json, bytes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 line comment

json = json.decode('utf-8')

lines = list(StringIO(json.strip()))
json = '[' + ','.join(lines) + ']'

Expand Down
38 changes: 36 additions & 2 deletions pandas/tests/io/json/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
from pandas.compat import (range, lrange, StringIO,
OrderedDict, is_platform_32bit)
import os

import numpy as np
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp,
read_json, compat)
from datetime import timedelta
moto = pytest.importorskip("moto")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will skip the entire module if moto isn't found. I think some of these tests can run without moto, so do this skip inside each test that needs moto.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed, moto is now always installed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as a regular import

import pandas as pd

from pandas.util.testing import (assert_almost_equal, assert_frame_equal,
Expand Down Expand Up @@ -985,12 +985,46 @@ def test_tz_range_is_utc(self):
df = DataFrame({'DT': dti})
assert dumps(df, iso_dates=True) == dfexp

def test_read_jsonl(self):
def test_read_inline_jsonl(self):
# GH9180
result = read_json('{"a": 1, "b": 2}\n{"b":2, "a" :1}\n', lines=True)
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])
assert_frame_equal(result, expected)

@pytest.yield_fixture(scope="function")
def testbucket_conn(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked, but can you use the fixture defined in pandas.tests.io.parser.test_network? I think if you use s3_resource. You may still nee the s3_resource.create_bucket, but you shouldn't have to worry about moto.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I was working on trying to reuse the code here. In order to reuse across moduled I'd probably need to move the s3_resource and its dependent fixtures to conftest.py so they'd be available to my test modules.

This may also entail changing how pointing to the data files that are mock-uploaded works since that code would no longer live in test_network. Do you agree with this? Or should I keep my implementation?

""" Fixture for test_read_s3_jsonl"""
boto3 = pytest.importorskip('boto3')
moto.mock_s3().start() # Start and stop mocking only once, surrounding the test run
# to ensure single context is kept.

conn = boto3.client('s3')
Copy link
Contributor

@TomAugspurger TomAugspurger Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setup stuff can be moved to a pytest fixture. Something like

@mock_s3
@pytest.fixture
def testbucket():
    """Fixture for ..."""
    conn = ...

Then your test_read_s3_jsonl will take a testbucket parameter. I don't know if it will have to be decorated by mock_s3, but I think it will.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move the boto3 import here and use boto3 = pytest.importorskip('boto3') so that tests trying to use the fixture will skip if boto3 isn't installed.

conn.create_bucket(Bucket='testbucket')
yield conn

moto.mock_s3().stop()

def test_read_s3_jsonl(self, testbucket_conn):
# GH17200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a pytest.importorskip() here I think.

testbucket_conn.put_object(Body=b'{"a": 1, "b": 2}\n{"b":2, "a" :1}\n', Key='items.jsonl', Bucket='testbucket')

result = read_json('s3://testbucket/items.jsonl', lines=True)
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])
assert_frame_equal(result, expected)

def test_read_local_jsonl(self):
# GH17200
fname = "./tmp_items.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the tm.ensure_clean context manager here? Then you don't have to worry about the try / finally.

try:
with open(fname, "w") as infile:
infile.write('{"a": 1, "b": 2}\n{"b":2, "a" :1}\n')
result = read_json(fname, lines=True)
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b'])
assert_frame_equal(result, expected)
finally:
import os
os.remove(fname)

def test_read_jsonl_unicode_chars(self):
# GH15132: non-ascii unicode characters
# \u201d == RIGHT DOUBLE QUOTATION MARK
Expand Down