-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Adding json line parsing to pd.read_json #9180 #13351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -4,6 +4,7 @@ | |||
import copy | |||
from collections import defaultdict | |||
import numpy as np | |||
import StringIO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.compat import StringIO
lgtm, pls add a whatsnew note. |
how does this look? |
@jreback I think I added all your suggestions. Thanks for the review! |
Current coverage is 84.31%@@ master #13351 diff @@
==========================================
Files 141 138 -3
Lines 51185 51177 -8
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43275 43151 -124
- Misses 7910 8026 +116
Partials 0 0
|
# If given a json lines file, we break the string into lines, add | ||
# commas and put it in a json list to make a valid json object. | ||
lines = list(StringIO(json)) | ||
json = '[' + ','.join(lines) + ']' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think json is typically unicode, but I think this might break in PY3 if the orginal json i encoded.
Can you see if this is the case? (I am not sure)
In [4]: pd.read_json(j.encode('utf-8') + ']')
TypeError: can't concat bytes to str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dask.dataframe users have started asking about this: dask/dask#1236 |
I think need to pass thru the encoding option first, then this PR can work. |
Yeah I'm looking through the code a bit more. The problem is that the current code works for json without an encoding. So adding this feature would make the encoding required which is not exactly a good experience for the user. Additionally, it seems there are several different standards. I'm still grokking the differences and how the parser works and will update the PR when I figure out the right answer (or if someone else has a better patch) |
@aterrel no, you just need to do something if This is already handled by |
Not sure if this is the right place to have this commentary, so please let me know if there is a better place. This code got a little messier than I like. The The It would be nice to have a Anywho, please let me know if I understand the situation incorrectly. As is it might be good to accept the patch as is and have a new issue that unifies the unicode readers of csv and json (and other formats that are unicode). |
@aterrel yeah its a little bit convoluted now, similar to a cleanup with compression, need to do same for unicode. I will make another issue for that. |
xref #13401 |
@@ -204,6 +214,18 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |||
else: | |||
json = filepath_or_buffer | |||
|
|||
is_bytes = isinstance(json, bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use
if isinstance(json, compat.binary_types):
json = compat.bytes_to_str(encoding)
this closes #13356 fully as well? lgtm otherwise. ping on green. |
|
||
.. versionadded:: 0.18.2 | ||
|
||
encoding : the encoding to use to decode py3 bytes, default is 'utf-8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the explanation on the indented next line?
@aterrel minor comments. ping when pushed and green. |
@@ -94,6 +94,9 @@ Other enhancements | |||
- ``eval``'s upcasting rules for ``float32`` types have been updated to be more consistent with NumPy's rules. New behavior will not upcast to ``float64`` if you multiply a pandas ``float32`` object by a scalar float64. (:issue:`12388`) | |||
- ``Series`` has gained the properties ``.is_monotonic``, ``.is_monotonic_increasing``, ``.is_monotonic_decreasing``, similar to ``Index`` (:issue:`13336`) | |||
|
|||
- ``pd.read_json`` has gained support for reading json lines with ``lines`` option (:issue:`9180`) | |||
- ``pd.read_json`` has gained support for accepting encoding of file or bytes buffer with ``encoding`` option (:issue:`13356`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, pls review the json docs (doc/source/io.rst) and see if anything needs to be mentioned / added (e.g. maybe a 1-sentence about how can handle line delimited json)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.19.0
Okay I'll try to get a few of these taken care of tonight. I should also make it be able to output to json lines as well. |
BTW, there are now conflicts with this PR. Is the usual thing to rebase? |
yes pls rebase |
@jreback @jorisvandenbossche I've rebased and added an encoding test. |
|
||
.. ipython:: python | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the import here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch.
fixed the doc issues, will add encodings for |
examples = [] | ||
for dtype in ['category', object]: | ||
for val in values: | ||
examples.append(pandas.Series(val, dtype=dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pandas here is not defined, and can just be removed I think (reason for travis failure)
lgtm. @jorisvandenbossche ? |
@@ -1064,6 +1064,13 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch', | |||
Handler to call if object cannot otherwise be converted to a | |||
suitable format for JSON. Should receive a single argument which is | |||
the object to convert and return a serialisable object. | |||
lines : boolean, defalut False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defalut -> default
I understand that docs for |
@@ -948,6 +948,58 @@ def test_tz_range_is_utc(self): | |||
df = DataFrame({'DT': dti}) | |||
self.assertEqual(dfexp, pd.json.dumps(df, iso_dates=True)) | |||
|
|||
def test_read_jsonl(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some tests that assert ValueError if invalid combination of lines=True and orient?
yes IIRC can add |
|
||
def roundtrip(s, encoding='latin-1'): | ||
with ensure_clean('test.json') as path: | ||
s.to_json(path, encoding=encoding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused because it is already used here (encoding keyword), while I don't see it in the docstring/signature of to_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a good point!
For the rest, merge away! |
thanks @aterrel nice PR! give a check in a few hours (or prob tomorrow) in the built dev-docs and see that everything looks ok for the changes. |
sweet. I'll follow along in the other issues to keep up all the issues. |
git diff upstream/master | flake8 --diff