ENH: Adding json line parsing to pd.read_json #9180 #13351

aterrel · 2016-06-02T23:56:55Z

closes Support ndjson -- newline delimited json -- for streaming data. #9180
closed ENH: support encoding in read_json #13356
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-06-03T00:14:39Z

pandas/io/json.py

@@ -4,6 +4,7 @@
 import copy
 from collections import defaultdict
 import numpy as np
+import StringIO


from pandas.compat import StringIO

jreback · 2016-06-03T00:16:02Z

lgtm, pls add a whatsnew note.

jreback · 2016-06-03T01:03:24Z

cc @Komnomnomnom

how does this look?

aterrel · 2016-06-03T01:13:34Z

@jreback I think I added all your suggestions. Thanks for the review!

codecov-io · 2016-06-03T03:10:49Z

Current coverage is 84.31%

Merging #13351 into master will decrease coverage by 0.22%

@@             master     #13351   diff @@
==========================================
  Files           141        138     -3   
  Lines         51185      51177     -8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43275      43151   -124   
- Misses         7910       8026   +116   
  Partials          0          0

Powered by Codecov. Last updated by 506520b...c56a6a8

jreback · 2016-06-03T15:23:25Z

pandas/io/json.py

+        # If given a json lines file, we break the string into lines, add
+        # commas and put it in a json list to make a valid json object.
+        lines = list(StringIO(json))
+        json = '[' + ','.join(lines) + ']'


I don't think json is typically unicode, but I think this might break in PY3 if the orginal json i encoded.

Can you see if this is the case? (I am not sure)

In [4]: pd.read_json(j.encode('utf-8') + ']') TypeError: can't concat bytes to str

Yeah it is a problem, I didn't think about it. I'll see if I can find a reliable way to do it in PY3

mrocklin · 2016-06-07T15:02:23Z

Dask.dataframe users have started asking about this: dask/dask#1236

jreback · 2016-06-07T15:09:26Z

I think need to pass thru the encoding option first, then this PR can work.

aterrel · 2016-06-07T20:47:35Z

Yeah I'm looking through the code a bit more. The problem is that the current code works for json without an encoding. So adding this feature would make the encoding required which is not exactly a good experience for the user. Additionally, it seems there are several different standards. I'm still grokking the differences and how the parser works and will update the PR when I figure out the right answer (or if someone else has a better patch)

jreback · 2016-06-07T20:49:59Z

@aterrel no, you just need to do something if encoding is not None which would be the default.

This is already handled by io/common/get_filepath_or_buffer, literally this just needs to be passed thru (with some tests :)

aterrel · 2016-06-08T14:51:59Z

Not sure if this is the right place to have this commentary, so please let me know if there is a better place.

This code got a little messier than I like. The get_filepath_or_buffer only decodes for url's (which I guess I should add a test for) and if a file is given it basically returns the filepath back and none for the encoding.

The _get_handler adds files correctly but I have to keep around the encoding.

It would be nice to have a gimme_unicode function that given any file, url, or buffer would produce the unicode stream handler. UnicodeReader looks like a good candidate but it is specialized to csv.

Anywho, please let me know if I understand the situation incorrectly. As is it might be good to accept the patch as is and have a new issue that unifies the unicode readers of csv and json (and other formats that are unicode).

jreback · 2016-06-08T15:07:17Z

@aterrel yeah its a little bit convoluted now, similar to a cleanup with compression, need to do same for unicode. I will make another issue for that.

jreback · 2016-06-08T15:10:08Z

xref #13401

jreback · 2016-06-30T10:54:28Z

pandas/io/json.py

@@ -204,6 +214,18 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
    else:
        json = filepath_or_buffer

+    is_bytes = isinstance(json, bytes)


use

if isinstance(json, compat.binary_types): json = compat.bytes_to_str(encoding)

jreback · 2016-06-30T10:56:37Z

this closes #13356 fully as well?

lgtm otherwise. ping on green.

jorisvandenbossche · 2016-06-30T22:38:51Z

pandas/io/json.py

+
+        .. versionadded:: 0.18.2
+
+    encoding : the encoding to use to decode py3 bytes, default is 'utf-8'


Can you put the explanation on the indented next line?

jreback · 2016-07-06T22:07:37Z

@aterrel minor comments. ping when pushed and green.

jreback · 2016-07-06T22:08:42Z

doc/source/whatsnew/v0.18.2.txt

@@ -94,6 +94,9 @@ Other enhancements
 - ``eval``'s upcasting rules for ``float32`` types have been updated to be more consistent with NumPy's rules.  New behavior will not upcast to ``float64`` if you multiply a pandas ``float32`` object by a scalar float64. (:issue:`12388`)
 - ``Series`` has gained the properties ``.is_monotonic``, ``.is_monotonic_increasing``, ``.is_monotonic_decreasing``, similar to ``Index`` (:issue:`13336`)

+- ``pd.read_json`` has gained support for reading json lines with ``lines`` option (:issue:`9180`)
+- ``pd.read_json`` has gained support for accepting encoding of file or bytes buffer with ``encoding`` option (:issue:`13356`)


actually, pls review the json docs (doc/source/io.rst) and see if anything needs to be mentioned / added (e.g. maybe a 1-sentence about how can handle line delimited json)

move to 0.19.0

aterrel · 2016-07-09T21:01:31Z

Okay I'll try to get a few of these taken care of tonight. I should also make it be able to output to json lines as well.

aterrel · 2016-07-09T21:01:56Z

BTW, there are now conflicts with this PR. Is the usual thing to rebase?

jreback · 2016-07-09T21:05:39Z

yes pls rebase

aterrel · 2016-07-19T03:53:19Z

@jreback @jorisvandenbossche I've rebased and added an encoding test.

jreback · 2016-07-19T10:53:23Z

doc/source/io.rst

+
+.. ipython:: python
+
+  import pandas as pd


don't need the import here

ah good catch.

aterrel · 2016-07-19T11:49:17Z

fixed the doc issues, will add encodings for to_json later.

jorisvandenbossche · 2016-07-19T21:01:31Z

pandas/io/tests/json/test_pandas.py

+        examples = []
+        for dtype in ['category', object]:
+            for val in values:
+                examples.append(pandas.Series(val, dtype=dtype))


The pandas here is not defined, and can just be removed I think (reason for travis failure)

jreback · 2016-07-20T21:19:34Z

lgtm. @jorisvandenbossche ?

jreback · 2016-07-20T21:21:16Z

pandas/core/generic.py

@@ -1064,6 +1064,13 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
            Handler to call if object cannot otherwise be converted to a
            suitable format for JSON. Should receive a single argument which is
            the object to convert and return a serialisable object.
+        lines : boolean, defalut False


defalut -> default

jorisvandenbossche · 2016-07-20T21:22:01Z

I understand that docs for encoding will be added in a follow-up PR? (as this is a new keyword as well?)

jreback · 2016-07-20T21:22:14Z

pandas/io/tests/json/test_pandas.py

@@ -948,6 +948,58 @@ def test_tz_range_is_utc(self):
        df = DataFrame({'DT': dti})
        self.assertEqual(dfexp, pd.json.dumps(df, iso_dates=True))

+    def test_read_jsonl(self):


can you add some tests that assert ValueError if invalid combination of lines=True and orient?

jreback · 2016-07-20T21:24:24Z

yes IIRC can add encoding in .to_json in a future issue; @aterrel can you create an issue for that as well.

jorisvandenbossche · 2016-07-20T21:27:08Z

pandas/io/tests/json/test_pandas.py

+
+        def roundtrip(s, encoding='latin-1'):
+            with ensure_clean('test.json') as path:
+                s.to_json(path, encoding=encoding)


I am confused because it is already used here (encoding keyword), while I don't see it in the docstring/signature of to_json

that is a good point!

jorisvandenbossche · 2016-07-20T21:27:23Z

For the rest, merge away!

jreback · 2016-07-24T14:14:41Z

thanks @aterrel

nice PR!

give a check in a few hours (or prob tomorrow) in the built dev-docs and see that everything looks ok for the changes.

xref #13351 xref #13774

aterrel · 2016-07-24T15:37:50Z

sweet. I'll follow along in the other issues to keep up all the issues.

jreback reviewed Jun 3, 2016
View reviewed changes

jreback added Enhancement IO JSON read_json, to_json, json_normalize labels Jun 3, 2016

jreback reviewed Jun 3, 2016
View reviewed changes

jreback mentioned this pull request Jun 3, 2016

ENH: support encoding in read_json #13356

Closed

jreback changed the title ~~ENH: Adding lines read #9180~~ ENH: Adding json line parsing to pd.read_json #9180 Jun 6, 2016

jreback mentioned this pull request Jun 8, 2016

CLN: unify unicode file handle processing #13401

Closed

jreback reviewed Jun 30, 2016
View reviewed changes

jreback added this to the 0.18.2 milestone Jun 30, 2016

jorisvandenbossche reviewed Jun 30, 2016
View reviewed changes

jreback reviewed Jul 6, 2016
View reviewed changes

aterrel added 6 commits July 18, 2016 22:09

Update json lines docs

c76dafe

Add to_json with lines test

b20798a

Pleasing flake8

f547b0d

Few doc fixes from @jorisvandenbossche

ae19f04

Fix issue with whitespace on either side of jsonl content

ac7b687

Test json encoding

f7c3bbf

jreback reviewed Jul 19, 2016
View reviewed changes

aterrel added 2 commits July 19, 2016 06:47

Fixing some minor doc issues

37252c6

Split line conversion to separate function to make code clearer

e635318

jorisvandenbossche reviewed Jul 19, 2016
View reviewed changes

Fix test failure with pandas namespace

32a2f8d

jreback reviewed Jul 20, 2016
View reviewed changes

jorisvandenbossche reviewed Jul 20, 2016
View reviewed changes

jreback closed this in 6efd743 Jul 24, 2016

jreback mentioned this pull request Jul 24, 2016

API: add encoding for to_json() #13774

Closed

jreback added a commit that referenced this pull request Jul 24, 2016

TST: skip .to_json with encoding test as not implemented yet

964b7bb

xref #13351 xref #13774

mrocklin mentioned this pull request Aug 3, 2016

read_json for dask.DataFrame dask/dask#1236

Closed


		.. versionadded:: 0.18.2

		encoding : the encoding to use to decode py3 bytes, default is 'utf-8'

ENH: Adding json line parsing to pd.read_json #9180 #13351

ENH: Adding json line parsing to pd.read_json #9180 #13351

Conversation

aterrel commented Jun 2, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Jun 3, 2016

jreback commented Jun 3, 2016

aterrel commented Jun 3, 2016

codecov-io commented Jun 3, 2016 • edited Loading

Current coverage is 84.31%

Choose a reason for hiding this comment

aterrel Jun 3, 2016 via email

Choose a reason for hiding this comment

mrocklin commented Jun 7, 2016

jreback commented Jun 7, 2016

aterrel commented Jun 7, 2016

jreback commented Jun 7, 2016

aterrel commented Jun 8, 2016

jreback commented Jun 8, 2016

jreback commented Jun 8, 2016

jreback Jun 30, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Jun 30, 2016

Choose a reason for hiding this comment

jreback commented Jul 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aterrel commented Jul 9, 2016

aterrel commented Jul 9, 2016

jreback commented Jul 9, 2016

aterrel commented Jul 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aterrel commented Jul 19, 2016

Choose a reason for hiding this comment

jreback commented Jul 20, 2016

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 20, 2016

Choose a reason for hiding this comment

jreback commented Jul 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 20, 2016

jreback commented Jul 24, 2016

aterrel commented Jul 24, 2016

aterrel commented Jun 2, 2016 •

edited

Loading

codecov-io commented Jun 3, 2016 •

edited

Loading

jreback Jun 30, 2016 •

edited

Loading