Skip to content

ENH: Added to_json_schema #14904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 4, 2017

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Dec 17, 2016

Lays the groundwork for (but doesn't close) #14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

Usage:

In [4]: df = pd.DataFrame(
   ...:     {'A': [1, 2, 3],
   ...:      'B': ['a', 'b', 'c'],
   ...:      'C': pd.date_range('2016-01-01', freq='d', periods=3),
   ...:      }, index=pd.Index(range(3), name='idx'))
   ...: df
   ...:
Out[4]:
     A  B          C
idx
0    1  a 2016-01-01
1    2  b 2016-01-02
2    3  c 2016-01-03

In [5]:

In [5]: pd.to_json_schema(df)
Out[5]:
{'fields': [{'name': 'idx', 'type': 'integer'},
  {'name': 'A', 'type': 'integer'},
  {'name': 'B', 'type': 'string'},
  {'name': 'C', 'type': 'date'}],
 'primary_key': 'idx'}

I think this is useful enough on its own to be part of the public API, so I've documented as such.
I've included a placeholder publish_tableschema that will not be included in the final commit.
It's just to make @rgbkrk's life easier for prototyping the nteract frontend. I think the proper solution for publishing the schema + data will have to wait on ipython/ipython#10090

@TomAugspurger
Copy link
Contributor Author

@pwalsh one data type question for you, what would a good JSON table schema type be for timedelta? I had thought duration, but IIUC that's not quite the same. A timedelta can always be converted to a total number of seconds, whereas a duration can't.

At the moment I don't attempt to distinguish between JSON Table Schema object or array types and any. Since pandas doesn't handle nested data all that well, this may not be a big deal. We can revisit in the future though.

@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Dec 17, 2016
@TomAugspurger TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions IO Data IO issues that don't fit into a more specific label labels Dec 17, 2016
@TomAugspurger
Copy link
Contributor Author

One more thing: this could also be the start for #9146, a roundtrip orient for JSON. The spec allows for additional properties at the "table" and "field" level. So we could have a pandas_dtype property on each field, and more information about the table,

{
        'type': 'DataFrame',
        'version': pd.__version__,
        'orient': 'records',
        'date_unit': 'ms'
}

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

isn't this just another json format orient?

or maybe need an argument schema=True in to_json()

@TomAugspurger
Copy link
Contributor Author

isn't this just another json format orient?

yes and no. This is just the schema, not the values. And this returns a dict instead of a serialized string.

I've put the doc section in io.rst, since it's related to IO, while not actually writing the data.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

this is very odd for a top level method

@rgbkrk
Copy link
Contributor

rgbkrk commented Dec 18, 2016

This is great to see so we at least have standardized types for us to work with on the front-end.

It didn't occur to me that we would need to carve out the top level format for publishing both the schema and the data. I like the { schema: schema, data: data } approach so far, set with a mimetype for the table schema. I'd then want to push on the R kernel to publish the same format for R data frames.

The matching data format is orient='rows' with to_json right?

@TomAugspurger
Copy link
Contributor Author

this is very odd for a top level method

Agreed. It probably belongs on DataFrames/Series, but I don't think it should have a to_* name, since that implies it's writing out the whole dataset, not just the schema.

The matching data format is orient='rows' with to_json right?

orient='records', with a .reset_index first if you want to include the index.

@jreback
Copy link
Contributor

jreback commented Dec 18, 2016

I don't think this should be a method at all
just an option in to_json

@pwalsh
Copy link

pwalsh commented Dec 19, 2016

@TomAugspurger I'd use duration for timedelta. I'm going to have to research it a bit more if you think there is some inconsistency here.

@rgbkrk
Copy link
Contributor

rgbkrk commented Jan 11, 2017

Hey @holdenk - could we support the same schema + data output for Spark DataFrames?

@holdenk
Copy link

holdenk commented Jan 12, 2017

We probably could, in fact the schema is already transfered between the JVM and PySpark using JSON so we might be able to just normalize on that format for interchange inside of PySpark its self.

@gnestor
Copy link

gnestor commented Jan 13, 2017

I like @jreback's suggestion to add a schemaargument to to_json method. I also like @rgbkrk's suggestion to return both the schema and data as { schema: schema, data: data }. Lastly (I'm not sure if this should be implemented in pandas) we want to consume this as a mime bundle on the front-end, so the result should look something like:

{
    "application/tableschema+json": {
        "schema": schema,
        "data": data
    }
}

@gnestor
Copy link

gnestor commented Jan 13, 2017

BTW, I have published a jupyterlab/notebook extension that will render JSON Table Schema: https://github.com/gnestor/jupyterlab_table

This is more of a WIP until some standards are in place (e.g. a mimetype for JSON Table Schema, pandas compatibility, etc.).

@pwalsh
Copy link

pwalsh commented Jan 13, 2017

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 13, 2017

I don't think this should be a method at all just an option in to_json

OK, coming around on this. One problem: right now to_json returns the serialized data as a str. Does anyone (@rgbkrk maybe?) know if there's a way to tell IPython.display.* that data has already been serialized?

import json
import IPython

# passing a python dict, which IPython serializes
IPython.display.display({"application/json": {"A": [1, 2, 3]}}, raw=True)

# Any way to do this?
IPython.display.display({"application/json": json.dumps({"A": [1, 2, 3]})}, raw=True)

or we could potentially include the mime-type in the already serialized data.

@rgbkrk
Copy link
Contributor

rgbkrk commented Jan 13, 2017

@minrk ^^

usually I pass a direct dict to IPython.display.display with raw=True. I'm not sure how I'd pass something already encoded since this data would be part of an overall JSON object.

@rgbkrk
Copy link
Contributor

rgbkrk commented Jan 13, 2017

Pardon me typing on mobile, I see now I'm repeating things already said.

@codecov-io
Copy link

codecov-io commented Jan 14, 2017

Codecov Report

Merging #14904 into master will decrease coverage by -0.06%.
The diff coverage is 87.58%.

@@            Coverage Diff             @@
##           master   #14904      +/-   ##
==========================================
- Coverage   91.07%   91.01%   -0.06%     
==========================================
  Files         136      137       +1     
  Lines       49167    49228      +61     
==========================================
+ Hits        44777    44806      +29     
- Misses       4390     4422      +32
Impacted Files Coverage Δ
pandas/io/json/json.py 90.27% <100%> (+1.04%)
pandas/io/json/init.py 100% <100%> (ø)
pandas/core/config_init.py 95.12% <100%> (-0.16%)
pandas/util/testing.py 81.11% <13.33%> (-0.96%)
pandas/core/generic.py 96.25% <91.3%> (-0.07%)
pandas/io/json/table_schema.py 95.58% <95.58%> (ø)
pandas/io/gbq.py 25% <0%> (-58.34%)
pandas/computation/pytables.py 90.6% <0%> (-0.99%)
pandas/tools/merge.py 91.78% <0%> (-0.35%)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ae4fd1...9fac34c. Read the comment docs.

@@ -1060,5 +1060,5 @@ def publish_tableschema(data):
"""Temporary helper for testing w/ frontend"""
from IPython.display import display
mimetype = 'application/vnd.tableschema.v1+json'
payload = data.to_json(orient='jsontable_schema')
payload = data.to_json(orient='json_table_schema')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"""
components = x.components
seconds = '{}.{:0>3}{:0>3}{:0>3}'.format(components.seconds,
components.milliseconds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's do this in a separate PR
ideally also be able to parse this as well (open an issue of u don't do it in same PR)

@rgbkrk
Copy link
Contributor

rgbkrk commented Jan 15, 2017

@pwalsh what would the media type and structure be for the combined schema plus data:

{
  schema: {...jsonTableSchemaHere},
  data: [...rows],
}

@pwalsh
Copy link

pwalsh commented Jan 15, 2017

@rgbkrk
Copy link
Contributor

rgbkrk commented Jan 15, 2017

Would it end up like this then, based on that spec?

{
  "resources": [{
    "format": "json",
    "data": [...],
    "schema": "table-schema"
  }],
  "schemas": {
    "table-schema": // inline here?
  }
}

@pwalsh
Copy link

pwalsh commented Jan 15, 2017

@rgbkrk

No - in this case, the top-level object is an object within your resources array, with the data inlined, and the schema inlined, exactly as #14904 (comment)

@@ -392,6 +392,9 @@ display.width 80 Width of the display in characters.
IPython qtconsole, or IDLE do not run in a
terminal and hence it is not possible
to correctly detect the width.
display.html.table_schema True Whether to publish a Table Schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we set it to False by default?

.. versionadded:: 0.20.0

``DataFrame`` and ``Series`` will publish a Table Schema representation
by default. This can be disabled globally with the ``display.html.table_schema``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is also outdated (default False)

@@ -1151,14 +1190,55 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',

.. versionadded:: 0.19.0

.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this link used somewhere?

int64 integer
float64 number
bool boolean
datetime64[ns] date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date -> datetime

column names to designate as the primary key.
The default `None` will set `'primaryKey'` to the index
level or levels if the index is unique.
version : bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, default True

Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
@rgbkrk
Copy link
Contributor

rgbkrk commented Mar 4, 2017

Needs a rebase. I'm super excited about this, thank you so much.

@gnestor
Copy link

gnestor commented Mar 4, 2017

Me too! Thanks for all your effort @TomAugspurger!

@rgbkrk
Copy link
Contributor

rgbkrk commented Mar 4, 2017

Played with this a bit locally, I'll renew my interest in making views based on this. 😄

@pwalsh
Copy link

pwalsh commented Mar 4, 2017

hey all - I also want to say thanks for the effort here - we at Open Knowledge International are very excited to see this land.

@jorisvandenbossche jorisvandenbossche merged commit 07ac39e into pandas-dev:master Mar 4, 2017
@jorisvandenbossche
Copy link
Member

@TomAugspurger Thanks a lot! 🎉

@TomAugspurger
Copy link
Contributor Author

Thanks Joris! Was just rebasing without realizing you'd already done it and was very confused about the conflicts.

@jorisvandenbossche
Copy link
Member

Sorry, I just fixed the conflict using github before merging

@jreback
Copy link
Contributor

jreback commented Mar 4, 2017

thanks @TomAugspurger

@rgbkrk rgbkrk mentioned this pull request Mar 5, 2017
5 tasks
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
xref pandas-dev#14904

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15322 from jreback/json and squashes the following commits:

0c2da60 [Jeff Reback] DOC: whatsnew update
fa3deef [Jeff Reback] CLN: reorg pandas/io/json to sub-dirs
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

Added publish to dataframe repr
@TomAugspurger TomAugspurger deleted the json-schema branch April 5, 2017 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.