Skip to content

ENH: Adding json line parsing to pd.read_json #9180 #13351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed

ENH: Adding json line parsing to pd.read_json #9180 #13351

wants to merge 16 commits into from

Conversation

aterrel
Copy link
Contributor

@aterrel aterrel commented Jun 2, 2016

@@ -4,6 +4,7 @@
import copy
from collections import defaultdict
import numpy as np
import StringIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.compat import StringIO

@jreback
Copy link
Contributor

jreback commented Jun 3, 2016

lgtm, pls add a whatsnew note.

@jreback jreback added Enhancement IO JSON read_json, to_json, json_normalize labels Jun 3, 2016
@jreback
Copy link
Contributor

jreback commented Jun 3, 2016

cc @Komnomnomnom

how does this look?

@aterrel
Copy link
Contributor Author

aterrel commented Jun 3, 2016

@jreback I think I added all your suggestions. Thanks for the review!

@codecov-io
Copy link

codecov-io commented Jun 3, 2016

Current coverage is 84.31%

Merging #13351 into master will decrease coverage by 0.22%

@@             master     #13351   diff @@
==========================================
  Files           141        138     -3   
  Lines         51185      51177     -8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43275      43151   -124   
- Misses         7910       8026   +116   
  Partials          0          0          

Powered by Codecov. Last updated by 506520b...c56a6a8

# If given a json lines file, we break the string into lines, add
# commas and put it in a json list to make a valid json object.
lines = list(StringIO(json))
json = '[' + ','.join(lines) + ']'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think json is typically unicode, but I think this might break in PY3 if the orginal json i encoded.

Can you see if this is the case? (I am not sure)

In [4]: pd.read_json(j.encode('utf-8') + ']')
TypeError: can't concat bytes to str

Copy link
Contributor Author

@aterrel aterrel Jun 3, 2016 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback changed the title ENH: Adding lines read #9180 ENH: Adding json line parsing to pd.read_json #9180 Jun 6, 2016
@mrocklin
Copy link
Contributor

mrocklin commented Jun 7, 2016

Dask.dataframe users have started asking about this: dask/dask#1236

@jreback
Copy link
Contributor

jreback commented Jun 7, 2016

I think need to pass thru the encoding option first, then this PR can work.

@aterrel
Copy link
Contributor Author

aterrel commented Jun 7, 2016

Yeah I'm looking through the code a bit more. The problem is that the current code works for json without an encoding. So adding this feature would make the encoding required which is not exactly a good experience for the user. Additionally, it seems there are several different standards. I'm still grokking the differences and how the parser works and will update the PR when I figure out the right answer (or if someone else has a better patch)

@jreback
Copy link
Contributor

jreback commented Jun 7, 2016

@aterrel no, you just need to do something if encoding is not None which would be the default.

This is already handled by io/common/get_filepath_or_buffer, literally this just needs to be passed thru (with some tests :)

@aterrel
Copy link
Contributor Author

aterrel commented Jun 8, 2016

Not sure if this is the right place to have this commentary, so please let me know if there is a better place.

This code got a little messier than I like. The get_filepath_or_buffer only decodes for url's (which I guess I should add a test for) and if a file is given it basically returns the filepath back and none for the encoding.

The _get_handler adds files correctly but I have to keep around the encoding.

It would be nice to have a gimme_unicode function that given any file, url, or buffer would produce the unicode stream handler. UnicodeReader looks like a good candidate but it is specialized to csv.

Anywho, please let me know if I understand the situation incorrectly. As is it might be good to accept the patch as is and have a new issue that unifies the unicode readers of csv and json (and other formats that are unicode).

@jreback
Copy link
Contributor

jreback commented Jun 8, 2016

@aterrel yeah its a little bit convoluted now, similar to a cleanup with compression, need to do same for unicode. I will make another issue for that.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2016

xref #13401

@@ -204,6 +214,18 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
else:
json = filepath_or_buffer

is_bytes = isinstance(json, bytes)
Copy link
Contributor

@jreback jreback Jun 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use

if isinstance(json, compat.binary_types):
    json = compat.bytes_to_str(encoding)

@jreback jreback added this to the 0.18.2 milestone Jun 30, 2016
@jreback
Copy link
Contributor

jreback commented Jun 30, 2016

this closes #13356 fully as well?

lgtm otherwise. ping on green.


.. versionadded:: 0.18.2

encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the explanation on the indented next line?

@jreback
Copy link
Contributor

jreback commented Jul 6, 2016

@aterrel minor comments. ping when pushed and green.

@@ -94,6 +94,9 @@ Other enhancements
- ``eval``'s upcasting rules for ``float32`` types have been updated to be more consistent with NumPy's rules. New behavior will not upcast to ``float64`` if you multiply a pandas ``float32`` object by a scalar float64. (:issue:`12388`)
- ``Series`` has gained the properties ``.is_monotonic``, ``.is_monotonic_increasing``, ``.is_monotonic_decreasing``, similar to ``Index`` (:issue:`13336`)

- ``pd.read_json`` has gained support for reading json lines with ``lines`` option (:issue:`9180`)
- ``pd.read_json`` has gained support for accepting encoding of file or bytes buffer with ``encoding`` option (:issue:`13356`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, pls review the json docs (doc/source/io.rst) and see if anything needs to be mentioned / added (e.g. maybe a 1-sentence about how can handle line delimited json)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.19.0

@aterrel
Copy link
Contributor Author

aterrel commented Jul 9, 2016

Okay I'll try to get a few of these taken care of tonight. I should also make it be able to output to json lines as well.

@aterrel
Copy link
Contributor Author

aterrel commented Jul 9, 2016

BTW, there are now conflicts with this PR. Is the usual thing to rebase?

@jreback
Copy link
Contributor

jreback commented Jul 9, 2016

yes pls rebase

@aterrel
Copy link
Contributor Author

aterrel commented Jul 19, 2016

@jreback @jorisvandenbossche I've rebased and added an encoding test.


.. ipython:: python

import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the import here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch.

@aterrel
Copy link
Contributor Author

aterrel commented Jul 19, 2016

fixed the doc issues, will add encodings for to_json later.

examples = []
for dtype in ['category', object]:
for val in values:
examples.append(pandas.Series(val, dtype=dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pandas here is not defined, and can just be removed I think (reason for travis failure)

@jreback
Copy link
Contributor

jreback commented Jul 20, 2016

lgtm. @jorisvandenbossche ?

@@ -1064,6 +1064,13 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
Handler to call if object cannot otherwise be converted to a
suitable format for JSON. Should receive a single argument which is
the object to convert and return a serialisable object.
lines : boolean, defalut False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defalut -> default

@jorisvandenbossche
Copy link
Member

I understand that docs for encoding will be added in a follow-up PR? (as this is a new keyword as well?)

@@ -948,6 +948,58 @@ def test_tz_range_is_utc(self):
df = DataFrame({'DT': dti})
self.assertEqual(dfexp, pd.json.dumps(df, iso_dates=True))

def test_read_jsonl(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests that assert ValueError if invalid combination of lines=True and orient?

@jreback
Copy link
Contributor

jreback commented Jul 20, 2016

yes IIRC can add encoding in .to_json in a future issue; @aterrel can you create an issue for that as well.


def roundtrip(s, encoding='latin-1'):
with ensure_clean('test.json') as path:
s.to_json(path, encoding=encoding)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused because it is already used here (encoding keyword), while I don't see it in the docstring/signature of to_json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a good point!

@jorisvandenbossche
Copy link
Member

For the rest, merge away!

@jreback jreback closed this in 6efd743 Jul 24, 2016
@jreback
Copy link
Contributor

jreback commented Jul 24, 2016

thanks @aterrel

nice PR!

give a check in a few hours (or prob tomorrow) in the built dev-docs and see that everything looks ok for the changes.

@aterrel
Copy link
Contributor Author

aterrel commented Jul 24, 2016

sweet. I'll follow along in the other issues to keep up all the issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: support encoding in read_json Support ndjson -- newline delimited json -- for streaming data.
5 participants