-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
COMPAT: reading json with lines=True from s3, xref #17200 #17201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 11 commits
8255cd0
caa3c80
4709251
a042e3c
0c164e7
ac99133
125f049
533d404
c7f13b8
17973a1
2ae5a9d
38f043b
1d03d7a
78ee720
9d7e75b
b21401b
6979fb8
bb600cb
5b036ec
c5c4d07
f1122ca
b913972
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,3 +8,4 @@ py | |
PyCrypto | ||
mock | ||
ipython | ||
moto | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,4 +17,4 @@ psycopg2 | |
patsy | ||
pymysql=0.6.3 | ||
jinja2=2.8 | ||
xarray=0.8.0 | ||
xarray=0.8.0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert this file |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,4 +16,4 @@ s3fs | |
psycopg2 | ||
pymysql | ||
html5lib | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
beautiful-soup | ||
beautiful-soup |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,3 @@ | ||
xarray==0.9.1 | ||
pandas-gbq | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this needds to be reverted |
||
moto |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,4 +17,4 @@ pymysql | |
psycopg2 | ||
s3fs | ||
beautifulsoup4 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert |
||
ipython | ||
ipython |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,4 +13,4 @@ jinja2 | |
bottleneck | ||
xarray | ||
s3fs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
beautifulsoup4 | ||
beautifulsoup4 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
brotlipy | ||
moto | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,4 +22,4 @@ fastparquet | |
beautifulsoup4 | ||
s3fs | ||
xarray | ||
ipython | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
ipython |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,4 +19,4 @@ psycopg2 | |
beautifulsoup4 | ||
s3fs | ||
xarray | ||
ipython | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
ipython |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,4 +19,4 @@ psycopg2 | |
beautifulsoup4 | ||
s3fs | ||
xarray | ||
ipython | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
ipython |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,3 +26,4 @@ sqlalchemy | |
bottleneck | ||
pymysql | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is ok |
||
Jinja2 | ||
s3fs |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,4 +5,5 @@ cython | |
pytest>=3.1.0 | ||
pytest-cov | ||
flake8 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. revert this, s3fs is NOT a requirement for dev; we should be robust to not having this installed |
||
s3fs | ||
moto |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -206,7 +206,7 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None, | |
if _is_s3_url(filepath_or_buffer): | ||
from pandas.io import s3 | ||
return s3.get_filepath_or_buffer(filepath_or_buffer, | ||
encoding=encoding, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't add spaces |
||
encoding=encoding, | ||
compression=compression) | ||
|
||
if isinstance(filepath_or_buffer, (compat.string_types, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -340,12 +340,21 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |
json = filepath_or_buffer | ||
elif hasattr(filepath_or_buffer, 'read'): | ||
json = filepath_or_buffer.read() | ||
|
||
else: | ||
json = filepath_or_buffer | ||
|
||
if lines: | ||
# If given a json lines file, we break the string into lines, add | ||
# commas and put it in a json list to make a valid json object. | ||
|
||
""" | ||
Handle encoded bytes arrays in PY3 and bytes objects from certain | ||
readables before using StringIO. | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just a 1-line comment is fine;
|
||
if isinstance(json, bytes): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1 line comment |
||
json = json.decode('utf-8') | ||
|
||
lines = list(StringIO(json.strip())) | ||
json = '[' + ','.join(lines) + ']' | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,11 +4,11 @@ | |
from pandas.compat import (range, lrange, StringIO, | ||
OrderedDict, is_platform_32bit) | ||
import os | ||
|
||
import numpy as np | ||
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, | ||
read_json, compat) | ||
from datetime import timedelta | ||
moto = pytest.importorskip("moto") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will skip the entire module if moto isn't found. I think some of these tests can run without moto, so do this skip inside each test that needs moto. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this can be removed, moto is now always installed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added as a regular import |
||
import pandas as pd | ||
|
||
from pandas.util.testing import (assert_almost_equal, assert_frame_equal, | ||
|
@@ -985,12 +985,46 @@ def test_tz_range_is_utc(self): | |
df = DataFrame({'DT': dti}) | ||
assert dumps(df, iso_dates=True) == dfexp | ||
|
||
def test_read_jsonl(self): | ||
def test_read_inline_jsonl(self): | ||
# GH9180 | ||
result = read_json('{"a": 1, "b": 2}\n{"b":2, "a" :1}\n', lines=True) | ||
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b']) | ||
assert_frame_equal(result, expected) | ||
|
||
@pytest.yield_fixture(scope="function") | ||
def testbucket_conn(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't looked, but can you use the fixture defined in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger I was working on trying to reuse the code here. In order to reuse across moduled I'd probably need to move the This may also entail changing how pointing to the data files that are mock-uploaded works since that code would no longer live in |
||
""" Fixture for test_read_s3_jsonl""" | ||
boto3 = pytest.importorskip('boto3') | ||
moto.mock_s3().start() # Start and stop mocking only once, surrounding the test run | ||
# to ensure single context is kept. | ||
|
||
conn = boto3.client('s3') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This setup stuff can be moved to a pytest fixture. Something like @mock_s3
@pytest.fixture
def testbucket():
"""Fixture for ..."""
conn = ... Then your There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also move the boto3 import here and use |
||
conn.create_bucket(Bucket='testbucket') | ||
yield conn | ||
|
||
moto.mock_s3().stop() | ||
|
||
def test_read_s3_jsonl(self, testbucket_conn): | ||
# GH17200 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a |
||
testbucket_conn.put_object(Body=b'{"a": 1, "b": 2}\n{"b":2, "a" :1}\n', Key='items.jsonl', Bucket='testbucket') | ||
|
||
result = read_json('s3://testbucket/items.jsonl', lines=True) | ||
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b']) | ||
assert_frame_equal(result, expected) | ||
|
||
def test_read_local_jsonl(self): | ||
# GH17200 | ||
fname = "./tmp_items.jsonl" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you use the |
||
try: | ||
with open(fname, "w") as infile: | ||
infile.write('{"a": 1, "b": 2}\n{"b":2, "a" :1}\n') | ||
result = read_json(fname, lines=True) | ||
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b']) | ||
assert_frame_equal(result, expected) | ||
finally: | ||
import os | ||
os.remove(fname) | ||
|
||
def test_read_jsonl_unicode_chars(self): | ||
# GH15132: non-ascii unicode characters | ||
# \u201d == RIGHT DOUBLE QUOTATION MARK | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed this is not always installed