Skip to content

DOC: Expand reference doc for read_json #14284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

DOC: Expand reference doc for read_json #14284

wants to merge 2 commits into from

Conversation

cswarth
Copy link
Contributor

@cswarth cswarth commented Sep 23, 2016

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • expanded reference document for pandas.read_json(), especially concentrating on the orient parameter. Also added some example usage code and explicitly mention to_json() as a source of valid JSON strings.

pandas.read_json

pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False)[source]

Convert a JSON string to pandas object

Parameters:

path_or_buf : a valid JSON string or file-like, default: None

The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/table.json

orient : string, indicating the expected format of the JSON input.

The set of allowed orients changes depending on the value of the typ parameter.

  • when typ == 'series',
    • allowed orients are {'split','records','index'}
    • default is 'index'
    • The Series index must be unique for orient 'index'.
  • when typ == 'frame',
    • allowed orients are {'split','records','index', 'columns','values'}
    • default is 'columns'
    • The DataFrame index must be unique for orients ‘index’ and ‘columns’.
    • The DataFrame columns must be unique for orients ‘index’, ‘columns’, and ‘records’.

The value of orient specifies the expected format of the JSON string. The expected JSON formats are compatible with the strings produced by to_json() with a corresponding value of orient.

  • 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
  • 'records' : list like [{column -> value}, ... , {column -> value}]
  • 'index' : dict like {index -> {column -> value}}
  • 'columns' : dict like {column -> {index -> value}}
  • 'values' : just the values array

typ : type of object to recover (series or frame), default ‘frame’

dtype : boolean or dict, default True

If True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, applies only to the data.

convert_axes : boolean, default True

Try to convert the axes to the proper dtypes.

convert_dates : boolean, default True

List of columns to parse for dates; If True, then try to parse datelike columns default is True; a column label is datelike if

  • it ends with '_at',
  • it ends with '_time',
  • it begins with 'timestamp',
  • it is 'modified', or
  • it is 'date'

keep_default_dates : boolean, default True

If parsing dates, then parse the default datelike columns

numpy : boolean, default False

Direct decoding to numpy arrays. Supports numeric data only, but non-numeric column and index labels are supported. Note also that the JSON ordering MUST be the same for each term if numpy=True.

precise_float : boolean, default False

Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality

date_unit : string, default None

The timestamp unit to detect if converting dates. The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds respectively.

lines : boolean, default False

Read the file as a json object per line.

New in version 0.19.0.

encoding : str, default is ‘utf-8’

The encoding to use to decode py3 bytes.

New in version 0.19.0.

Returns:

result : Series or DataFrame, depending on the value of typ.

Examples

>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
                          index=['row 1', 'row 2'],
                          columns=['col 1', 'col 2'])
>>> print df
      col 1 col 2
row 1     a     b
row 2     c     d
>>> for orient in ['split', 'records', 'index']:
        str = df.to_json(orient=orient)
        print "'{}': '{}'".format(orient, str)
        pd.read_json(str, orient=orient)
'split':
'{"columns":["col 1","col 2"],"index":["row 1","row 2"],"data":[["a","b"],
["c","d"]]}'
'records':
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
'index':
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'

@jreback jreback added Docs IO JSON read_json, to_json, json_normalize labels Sep 23, 2016
@cswarth
Copy link
Contributor Author

cswarth commented Sep 23, 2016

I don't think the failing CI checks are a consequence of the changes I propose in this PR, which are literally only changing python comments.

Is there anything I can do to get the PR a clean bill of health?

@jorisvandenbossche jorisvandenbossche changed the title DOC: Expand reference doc for panda.read_json() DOC: Expand reference doc for read_json Sep 23, 2016
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cswarth Thanks a lot! Clearer docs are always welcome.
I put some small comments

@@ -123,32 +123,39 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
file. For file URLs, a host is expected. For instance, a local file
could be ``file://localhost/path/to/table.json``

orient
orient : string, indicating the expected format of the JSON input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the explanation on the next line? (but leave the type (so 'string') on this one)

orient
orient : string, indicating the expected format of the JSON input.
The set of allowed orients changes depending on the value
of the ``typ`` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to closely follow the numpy docstring standard, refering to other keywords would be with single backticks instead of double

strings produced by ``to_json()`` with a corresponding value
of ``orient``.

- ``'split'`` : dict like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra indentation compared to the previous paragraph is not needed

>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
index=['row 1', 'row 2'],
columns=['col 1', 'col 2'])
>>> print df
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'print' is not needed

--------

>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
index=['row 1', 'row 2'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you align this nicer?

>>> for orient in ['split', 'records', 'index']:
str = df.to_json(orient=orient)
print "'{}': '{}'".format(orient, str)
pd.read_json(str, orient=orient)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just use separate lines instead of the for loop, I personally think this is going to be clearer for the reader

I mean like

>>> df.to_json(orient='split')
.. output ..

>>> df.to_json(orient='records')
.. output ..

....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think of the following examples? We're trying to document pd.read_json(), but df.to_json() is along for the ride as a convenient source of well-formatted JSON strings.

The results are a little artificial in that I had to reformat the output of df.to_json(orient='split') to avoid the flake8-imposed constraint on line length.

I also used '' to retrieve previous results, but that syntax is not available in ipython when the prompt is '>>> ', as that indicates previous result caching is turned off. I think using '' makes the examples a lot easier to understand, but they won't work if pasted into %doctest_mode

screen shot 2016-09-23 at 3 39 30 pm

@codecov-io
Copy link

codecov-io commented Sep 26, 2016

Current coverage is 85.25% (diff: 100%)

Merging #14284 into master will decrease coverage by <.01%

@@             master     #14284   diff @@
==========================================
  Files           140        140          
  Lines         50579      50579          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43123      43122     -1   
- Misses         7456       7457     +1   
  Partials          0          0          

Powered by Codecov. Last update 99b5876...4689d3a

@cswarth
Copy link
Contributor Author

cswarth commented Sep 26, 2016

Preview of how the documentation looks after incorporating review comments.

Parameters:

path_or_buf : a valid JSON string or file-like, default: None

The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/table.json

meta_prefix : string, default None

orient : string,

Indication of expected JSON input format. The set of allowed orients changes depending on the value of the typ parameter.

  • when typ == 'series',
    • allowed orients are {'split','records','index'}
    • default is 'index'
    • The Series index must be unique for orient 'index'.
  • when typ == 'frame',
    • allowed orients are {'split','records','index', 'columns','values'}
    • default is 'columns'
    • The DataFrame index must be unique for orients 'index' and 'columns'.
    • The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.

The value of orient specifies the expected format of the JSON string. The expected JSON formats are compatible with the strings produced by to_json() with a corresponding value of orient.

  • 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
  • 'records' : list like [{column -> value}, ... , {column -> value}]
  • 'index' : dict like {index -> {column -> value}}
  • 'columns' : dict like {column -> {index -> value}}
  • 'values' : just the values array

typ : type of object to recover (series or frame), default ‘frame’

dtype : boolean or dict, default True

If True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at all, applies only to the data.

convert_axes : boolean, default True

Try to convert the axes to the proper dtypes.

convert_dates : boolean, default True

List of columns to parse for dates; If True, then try to parse datelike columns default is True; a column label is datelike if

  • it ends with '_at',
  • it ends with '_time',
  • it begins with 'timestamp',
  • it is 'modified', or
  • it is 'date'

keep_default_dates : boolean, default True

If parsing dates, then parse the default datelike columns

numpy : boolean, default False

Direct decoding to numpy arrays. Supports numeric data only, but non-numeric column and index labels are supported. Note also that the JSON ordering MUST be the same for each term if numpy=True.

precise_float : boolean, default False

Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality

date_unit : string, default None

The timestamp unit to detect if converting dates. The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds respectively.

lines : boolean, default False

Read the file as a json object per line.

New in version 0.19.0.

encoding : str, default is ‘utf-8’

The encoding to use to decode py3 bytes.

New in version 0.19.0.

Returns:

result : Series or DataFrame, depending on the value of typ.

Examples

>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
...                   index=['row 1', 'row 2'],
...                   columns=['col 1', 'col 2'])
>>> df.to_json(orient='split')
'{"columns":["col 1","col 2"],
  "index":["row 1","row 2"],
  "data":[["a","b"],["c","d"]]}'
>>> pd.read_json(_, orient='split')
      col 1 col 2
row 1     a     b
row 2     c     d
>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> pd.read_json(_, orient='records')
  col 1 col 2
0     a     b
1     c     d
>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> pd.read_json(_, orient='index')
      col 1 col 2
row 1     a     b
row 2     c     d

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looking good! (left some small further comments)

No problem in adapting the output of to_json to satisfy flake

Maybe it would be nice to also have an example showing the use of typ? (but can also leave for other PR)

@@ -122,33 +122,42 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
The string could be a URL. Valid URL schemes include http, ftp, s3, and
file. For file URLs, a host is expected. For instance, a local file
could be ``file://localhost/path/to/table.json``
meta_prefix : string, default None
Copy link
Member

@jorisvandenbossche jorisvandenbossche Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-pasta error - removed

``'columns'``, and ``'records'``.


The value of `orient` specifies the expected format of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two blank lines are not needed above this one (one blank line is OK).

But something else: would it make it more clear to first list the possibilities, and then which of those is the default/accepted value depending on the type? (just an idea)

'{"columns":["col 1","col 2"],
"index":["row 1","row 2"],
"data":[["a","b"],["c","d"]]}'
<BLANKLINE>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is to have a blank line in the resulting code block, but to keep it as one code block? (so it's clearer they belong together).
That's a good idea I think, only a pity for the plain text docstring ..

BTW, you can also put some 'introducing' text in between the code examples when this can make it clearer what you are showing. (and that can also help delineate the different examples)

@jorisvandenbossche
Copy link
Member

@cswarth Do you have time to update this? It's a really nice improvement of the docstring!

@cswarth
Copy link
Contributor Author

cswarth commented Oct 14, 2016

I'm mystified and could use some help to figure out what's going on. I pushed a commit to my branch to address your review, but this PR is not picking up the changes.

I can see the commit on the branch, but the commits link at the top of this page insists there are only two commits for this PR.

I can't figure out what I've screwed up here.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2016

@cswarth yeah we changed the base github domain to pandas-dev and it seems can push existing PR's. So close this, one and open a new PR.

@cswarth
Copy link
Contributor Author

cswarth commented Oct 17, 2016

Closing to move PR to new github domain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants