ENH: Added to_json_schema #14904

TomAugspurger · 2016-12-17T18:20:08Z

Lays the groundwork for (but doesn't close) #14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

Usage:

In [4]: df = pd.DataFrame(
   ...:     {'A': [1, 2, 3],
   ...:      'B': ['a', 'b', 'c'],
   ...:      'C': pd.date_range('2016-01-01', freq='d', periods=3),
   ...:      }, index=pd.Index(range(3), name='idx'))
   ...: df
   ...:
Out[4]:
     A  B          C
idx
0    1  a 2016-01-01
1    2  b 2016-01-02
2    3  c 2016-01-03

In [5]:

In [5]: pd.to_json_schema(df)
Out[5]:
{'fields': [{'name': 'idx', 'type': 'integer'},
  {'name': 'A', 'type': 'integer'},
  {'name': 'B', 'type': 'string'},
  {'name': 'C', 'type': 'date'}],
 'primary_key': 'idx'}

I think this is useful enough on its own to be part of the public API, so I've documented as such.
I've included a placeholder publish_tableschema that will not be included in the final commit.
It's just to make @rgbkrk's life easier for prototyping the nteract frontend. I think the proper solution for publishing the schema + data will have to wait on ipython/ipython#10090

TomAugspurger · 2016-12-17T18:23:37Z

@pwalsh one data type question for you, what would a good JSON table schema type be for timedelta? I had thought duration, but IIUC that's not quite the same. A timedelta can always be converted to a total number of seconds, whereas a duration can't.

At the moment I don't attempt to distinguish between JSON Table Schema object or array types and any. Since pandas doesn't handle nested data all that well, this may not be a big deal. We can revisit in the future though.

TomAugspurger · 2016-12-17T18:34:09Z

One more thing: this could also be the start for #9146, a roundtrip orient for JSON. The spec allows for additional properties at the "table" and "field" level. So we could have a pandas_dtype property on each field, and more information about the table,

{
        'type': 'DataFrame',
        'version': pd.__version__,
        'orient': 'records',
        'date_unit': 'ms'
}

jreback · 2016-12-17T18:40:53Z

isn't this just another json format orient?

or maybe need an argument schema=True in to_json()

TomAugspurger · 2016-12-17T19:35:43Z

isn't this just another json format orient?

yes and no. This is just the schema, not the values. And this returns a dict instead of a serialized string.

I've put the doc section in io.rst, since it's related to IO, while not actually writing the data.

jreback · 2016-12-17T19:52:16Z

this is very odd for a top level method

rgbkrk · 2016-12-18T00:01:37Z

This is great to see so we at least have standardized types for us to work with on the front-end.

It didn't occur to me that we would need to carve out the top level format for publishing both the schema and the data. I like the { schema: schema, data: data } approach so far, set with a mimetype for the table schema. I'd then want to push on the R kernel to publish the same format for R data frames.

The matching data format is orient='rows' with to_json right?

TomAugspurger · 2016-12-18T15:46:39Z

this is very odd for a top level method

Agreed. It probably belongs on DataFrames/Series, but I don't think it should have a to_* name, since that implies it's writing out the whole dataset, not just the schema.

The matching data format is orient='rows' with to_json right?

orient='records', with a .reset_index first if you want to include the index.

jreback · 2016-12-18T15:48:08Z

I don't think this should be a method at all
just an option in to_json

pwalsh · 2016-12-19T08:22:09Z

@TomAugspurger I'd use duration for timedelta. I'm going to have to research it a bit more if you think there is some inconsistency here.

rgbkrk · 2017-01-11T20:29:44Z

Hey @holdenk - could we support the same schema + data output for Spark DataFrames?

holdenk · 2017-01-12T19:22:06Z

We probably could, in fact the schema is already transfered between the JVM and PySpark using JSON so we might be able to just normalize on that format for interchange inside of PySpark its self.

gnestor · 2017-01-13T00:15:26Z

I like @jreback's suggestion to add a schemaargument to to_json method. I also like @rgbkrk's suggestion to return both the schema and data as { schema: schema, data: data }. Lastly (I'm not sure if this should be implemented in pandas) we want to consume this as a mime bundle on the front-end, so the result should look something like:

{
    "application/tableschema+json": {
        "schema": schema,
        "data": data
    }
}

gnestor · 2017-01-13T00:17:55Z

BTW, I have published a jupyterlab/notebook extension that will render JSON Table Schema: https://github.com/gnestor/jupyterlab_table

This is more of a WIP until some standards are in place (e.g. a mimetype for JSON Table Schema, pandas compatibility, etc.).

pwalsh · 2017-01-13T12:34:47Z

Related media types have been registered over the new year:

TomAugspurger · 2017-01-13T13:47:17Z

I don't think this should be a method at all just an option in to_json

OK, coming around on this. One problem: right now to_json returns the serialized data as a str. Does anyone (@rgbkrk maybe?) know if there's a way to tell IPython.display.* that data has already been serialized?

import json
import IPython

# passing a python dict, which IPython serializes
IPython.display.display({"application/json": {"A": [1, 2, 3]}}, raw=True)

# Any way to do this?
IPython.display.display({"application/json": json.dumps({"A": [1, 2, 3]})}, raw=True)

or we could potentially include the mime-type in the already serialized data.

rgbkrk · 2017-01-13T15:36:12Z

@minrk ^^

usually I pass a direct dict to IPython.display.display with raw=True. I'm not sure how I'd pass something already encoded since this data would be part of an overall JSON object.

rgbkrk · 2017-01-13T15:37:10Z

Pardon me typing on mobile, I see now I'm repeating things already said.

codecov-io · 2017-01-14T22:07:18Z

Codecov Report

Merging #14904 into master will decrease coverage by -0.06%.
The diff coverage is 87.58%.

@@            Coverage Diff             @@
##           master   #14904      +/-   ##
==========================================
- Coverage   91.07%   91.01%   -0.06%     
==========================================
  Files         136      137       +1     
  Lines       49167    49228      +61     
==========================================
+ Hits        44777    44806      +29     
- Misses       4390     4422      +32

Impacted Files	Coverage Δ
pandas/io/json/json.py	`90.27% <100%> (+1.04%)`	✅
pandas/io/json/init.py	`100% <100%> (ø)`	✅
pandas/core/config_init.py	`95.12% <100%> (-0.16%)`	❌
pandas/util/testing.py	`81.11% <13.33%> (-0.96%)`	❌
pandas/core/generic.py	`96.25% <91.3%> (-0.07%)`	❌
pandas/io/json/table_schema.py	`95.58% <95.58%> (ø)`
pandas/io/gbq.py	`25% <0%> (-58.34%)`	❌
pandas/computation/pytables.py	`90.6% <0%> (-0.99%)`	❌
pandas/tools/merge.py	`91.78% <0%> (-0.35%)`	❌
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ae4fd1...9fac34c. Read the comment docs.

gnestor · 2017-01-14T23:05:35Z

pandas/io/json.py

@@ -1060,5 +1060,5 @@ def publish_tableschema(data):
    """Temporary helper for testing w/ frontend"""
    from IPython.display import display
    mimetype = 'application/vnd.tableschema.v1+json'
-    payload = data.to_json(orient='jsontable_schema')
+    payload = data.to_json(orient='json_table_schema')


jreback · 2017-01-14T23:10:16Z

pandas/io/json.py

+    """
+    components = x.components
+    seconds = '{}.{:0>3}{:0>3}{:0>3}'.format(components.seconds,
+                                             components.milliseconds,


yeah let's do this in a separate PR
ideally also be able to parse this as well (open an issue of u don't do it in same PR)

rgbkrk · 2017-01-15T00:41:10Z

@pwalsh what would the media type and structure be for the combined schema plus data:

{
  schema: {...jsonTableSchemaHere},
  data: [...rows],
}

pwalsh · 2017-01-15T06:55:46Z

@rgbkrk that would be http://www.iana.org/assignments/media-types/application/vnd.dataresource+json

rgbkrk · 2017-01-15T07:05:36Z

Would it end up like this then, based on that spec?

{
  "resources": [{
    "format": "json",
    "data": [...],
    "schema": "table-schema"
  }],
  "schemas": {
    "table-schema": // inline here?
  }
}

pwalsh · 2017-01-15T07:19:53Z

@rgbkrk

No - in this case, the top-level object is an object within your resources array, with the data inlined, and the schema inlined, exactly as #14904 (comment)

jorisvandenbossche · 2017-03-02T20:20:28Z

doc/source/options.rst

@@ -392,6 +392,9 @@ display.width              80           Width of the display in characters.
                                        IPython qtconsole, or IDLE do not run in a
                                        terminal and hence it is not possible
                                        to correctly detect the width.
+display.html.table_schema  True         Whether to publish a Table Schema


Did we set it to False by default?

jorisvandenbossche · 2017-03-02T20:21:28Z

doc/source/options.rst

+.. versionadded:: 0.20.0
+
+``DataFrame`` and ``Series`` will publish a Table Schema representation
+by default. This can be disabled globally with the ``display.html.table_schema``


This section is also outdated (default False)

jorisvandenbossche · 2017-03-02T20:23:38Z

pandas/core/generic.py

@@ -1151,14 +1190,55 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',

            .. versionadded:: 0.19.0

+        .. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/


Is this link used somewhere?

jorisvandenbossche · 2017-03-02T20:24:23Z

doc/source/io.rst

+int64           integer
+float64         number
+bool            boolean
+datetime64[ns]  date


date -> datetime

jorisvandenbossche · 2017-03-02T20:25:05Z

pandas/io/json/table_schema.py

+        column names to designate as the primary key.
+        The default `None` will set `'primaryKey'` to the index
+        level or levels if the index is unique.
+    version : bool


, default True

Lays the groundwork for pandas-dev#14386 This handles the schema part of the request there. We'll still need to do the work to publish the data to the frontend, but that can be done as a followup. DOC: More notes in prose docs Move files use isoformat updates Moved to to_json json_table no config refactor with classes Added duration tests more timedelta Change default orient Series test fixup docs JSON Table -> Table doc Change to table orient added version Handle Categorical Many more tests

rgbkrk · 2017-03-04T00:31:09Z

Needs a rebase. I'm super excited about this, thank you so much.

gnestor · 2017-03-04T00:58:26Z

Me too! Thanks for all your effort @TomAugspurger!

rgbkrk · 2017-03-04T08:53:14Z

Played with this a bit locally, I'll renew my interest in making views based on this. 😄

pwalsh · 2017-03-04T09:23:21Z

hey all - I also want to say thanks for the effort here - we at Open Knowledge International are very excited to see this land.

jorisvandenbossche · 2017-03-04T12:00:49Z

@TomAugspurger Thanks a lot! 🎉

TomAugspurger · 2017-03-04T12:28:39Z

Thanks Joris! Was just rebasing without realizing you'd already done it and was very confused about the conflicts.

jorisvandenbossche · 2017-03-04T12:30:39Z

Sorry, I just fixed the conflict using github before merging

jreback · 2017-03-04T14:11:39Z

thanks @TomAugspurger

xref pandas-dev#14904 Author: Jeff Reback <[email protected]> Closes pandas-dev#15322 from jreback/json and squashes the following commits: 0c2da60 [Jeff Reback] DOC: whatsnew update fa3deef [Jeff Reback] CLN: reorg pandas/io/json to sub-dirs

Lays the groundwork for pandas-dev#14386 This handles the schema part of the request there. We'll still need to do the work to publish the data to the frontend, but that can be done as a followup. Added publish to dataframe repr

TomAugspurger added the Enhancement label Dec 17, 2016

TomAugspurger added this to the 0.20.0 milestone Dec 17, 2016

TomAugspurger added Dtype Conversions Unexpected or buggy dtype conversions IO Data IO issues that don't fit into a more specific label labels Dec 17, 2016

TomAugspurger force-pushed the json-schema branch from c40309c to dc4daa4 Compare January 14, 2017 22:07

gnestor reviewed Jan 14, 2017

View reviewed changes

jreback reviewed Jan 14, 2017

View reviewed changes

TomAugspurger mentioned this pull request Jan 15, 2017

ENH: Timedelta isoformat #15136

Closed

jorisvandenbossche reviewed Mar 2, 2017

View reviewed changes

TomAugspurger added 2 commits March 2, 2017 20:17

Added publish to dataframe repr

07d90cb

TomAugspurger force-pushed the json-schema branch from e344a77 to 07d90cb Compare March 3, 2017 02:18

rgbkrk mentioned this pull request Mar 4, 2017

Create a Table Transform for application/vnd.dataresource+json nteract/nteract#1526

Closed

rgbkrk approved these changes Mar 4, 2017

View reviewed changes

Merge branch 'master' into json-schema

9fac34c

jorisvandenbossche merged commit 07ac39e into pandas-dev:master Mar 4, 2017

rgbkrk mentioned this pull request Mar 4, 2017

React Component for a data resource frictionlessdata/datapackage-js#65

Closed

gnestor mentioned this pull request Mar 4, 2017

[WIP] Use df.to_json(orient='table') gnestor/jupyterlab_table#8

Open

jorisvandenbossche mentioned this pull request Mar 4, 2017

DOC: fix build_table_schema docs #15571

Merged

rgbkrk mentioned this pull request Mar 5, 2017

Data resource renderer nteract/nteract#1534

Merged

5 tasks

TomAugspurger deleted the json-schema branch April 5, 2017 02:07

This was referenced Apr 10, 2017

Tabular Data Format jupyter/jvm-repr#6

Open

New table style and pandas MultiIndex issues jupyter/notebook#2408

Open

jorisvandenbossche mentioned this pull request Apr 14, 2017

Table Schema bombs with MultiIndex #15996

Closed

TomAugspurger mentioned this pull request Apr 15, 2017

Support exporting to JSON Table Schema #14386

Closed

TomAugspurger mentioned this pull request Apr 28, 2017

reprs no longer return anything in the notebook #16168

Closed

rgbkrk mentioned this pull request Jul 25, 2017

Support Data Resource mimetype directly in classic jupyter/notebook#2685

Open

hydrosquall mentioned this pull request Jul 6, 2021

Data Explorer breaks when dataframe cell has complex data in it nteract/data-explorer#36

Open

		@@ -1151,14 +1190,55 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',

		.. versionadded:: 0.19.0

		.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/

Uh oh!

ENH: Added to_json_schema #14904

ENH: Added to_json_schema #14904

Uh oh!

Conversation

TomAugspurger commented Dec 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Dec 17, 2016

Uh oh!

TomAugspurger commented Dec 17, 2016

Uh oh!

jreback commented Dec 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Dec 17, 2016

Uh oh!

jreback commented Dec 17, 2016

Uh oh!

rgbkrk commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Dec 18, 2016

Uh oh!

jreback commented Dec 18, 2016

Uh oh!

pwalsh commented Dec 19, 2016

Uh oh!

rgbkrk commented Jan 11, 2017

Uh oh!

holdenk commented Jan 12, 2017

Uh oh!

gnestor commented Jan 13, 2017

Uh oh!

gnestor commented Jan 13, 2017

Uh oh!

pwalsh commented Jan 13, 2017

Uh oh!

TomAugspurger commented Jan 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgbkrk commented Jan 13, 2017

Uh oh!

rgbkrk commented Jan 13, 2017

Uh oh!

codecov-io commented Jan 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gnestor Jan 14, 2017

Choose a reason for hiding this comment

Uh oh!

jreback Jan 14, 2017

Choose a reason for hiding this comment

Uh oh!

rgbkrk commented Jan 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwalsh commented Jan 15, 2017

Uh oh!

rgbkrk commented Jan 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwalsh commented Jan 15, 2017

Uh oh!

jorisvandenbossche Mar 2, 2017

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 2, 2017

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 2, 2017

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 2, 2017

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 2, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Dec 17, 2016 •

edited

Loading

jreback commented Dec 17, 2016 •

edited

Loading

rgbkrk commented Dec 18, 2016 •

edited

Loading

TomAugspurger commented Jan 13, 2017 •

edited

Loading

codecov-io commented Jan 14, 2017 •

edited

Loading

rgbkrk commented Jan 15, 2017 •

edited

Loading

rgbkrk commented Jan 15, 2017 •

edited

Loading