Skip to content

Commit 07ac39e

Browse files
TomAugspurgerjorisvandenbossche
authored andcommitted
ENH: Added to_json_schema (#14904)
Lays the groundwork for #14386 This handles the schema part of the request there. We'll still need to do the work to publish the data to the frontend, but that can be done as a followup. Added publish to dataframe repr
1 parent 7ae4fd1 commit 07ac39e

15 files changed

+1072
-16
lines changed

ci/requirements-2.7.pip

+2
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ pathlib
44
backports.lzma
55
py
66
PyCrypto
7+
mock
8+
ipython

ci/requirements-3.5.run

+1
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ pymysql
1818
psycopg2
1919
s3fs
2020
beautifulsoup4
21+
ipython

ci/requirements-3.6.run

+1
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ pymysql
1818
beautifulsoup4
1919
s3fs
2020
xarray
21+
ipython

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ JSON
6060
:toctree: generated/
6161

6262
json_normalize
63+
build_table_schema
6364

6465
.. currentmodule:: pandas
6566

doc/source/io.rst

+120
Original file line numberDiff line numberDiff line change
@@ -2033,6 +2033,126 @@ using Hadoop or Spark.
20332033
df
20342034
df.to_json(orient='records', lines=True)
20352035
2036+
2037+
.. _io.table_schema:
2038+
2039+
Table Schema
2040+
''''''''''''
2041+
2042+
.. versionadded:: 0.20.0
2043+
2044+
`Table Schema`_ is a spec for describing tabular datasets as a JSON
2045+
object. The JSON includes information on the field names, types, and
2046+
other attributes. You can use the orient ``table`` to build
2047+
a JSON string with two fields, ``schema`` and ``data``.
2048+
2049+
.. ipython:: python
2050+
2051+
df = pd.DataFrame(
2052+
{'A': [1, 2, 3],
2053+
'B': ['a', 'b', 'c'],
2054+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
2055+
}, index=pd.Index(range(3), name='idx'))
2056+
df
2057+
df.to_json(orient='table', date_format="iso")
2058+
2059+
The ``schema`` field contains the ``fields`` key, which itself contains
2060+
a list of column name to type pairs, including the ``Index`` or ``MultiIndex``
2061+
(see below for a list of types).
2062+
The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index
2063+
is unique.
2064+
2065+
The second field, ``data``, contains the serialized data with the ``records``
2066+
orient.
2067+
The index is included, and any datetimes are ISO 8601 formatted, as required
2068+
by the Table Schema spec.
2069+
2070+
The full list of types supported are described in the Table Schema
2071+
spec. This table shows the mapping from pandas types:
2072+
2073+
============== =================
2074+
Pandas type Table Schema type
2075+
============== =================
2076+
int64 integer
2077+
float64 number
2078+
bool boolean
2079+
datetime64[ns] datetime
2080+
timedelta64[ns] duration
2081+
categorical any
2082+
object str
2083+
=============== =================
2084+
2085+
A few notes on the generated table schema:
2086+
2087+
- The ``schema`` object contains a ``pandas_version`` field. This contains
2088+
the version of pandas' dialect of the schema, and will be incremented
2089+
with each revision.
2090+
- All dates are converted to UTC when serializing. Even timezone naïve values,
2091+
which are treated as UTC with an offset of 0.
2092+
2093+
.. ipython:: python:
2094+
2095+
from pandas.io.json import build_table_schema
2096+
s = pd.Series(pd.date_range('2016', periods=4))
2097+
build_table_schema(s)
2098+
2099+
- datetimes with a timezone (before serializing), include an additional field
2100+
``tz`` with the time zone name (e.g. ``'US/Central'``).
2101+
2102+
.. ipython:: python
2103+
2104+
s_tz = pd.Series(pd.date_range('2016', periods=12,
2105+
tz='US/Central'))
2106+
build_table_schema(s_tz)
2107+
2108+
- Periods are converted to timestamps before serialization, and so have the
2109+
same behavior of being converted to UTC. In addition, periods will contain
2110+
and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``
2111+
2112+
.. ipython:: python
2113+
2114+
s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
2115+
periods=4))
2116+
build_table_schema(s_per)
2117+
2118+
- Categoricals use the ``any`` type and an ``enum`` constraint listing
2119+
the set of possible values. Additionally, an ``ordered`` field is included
2120+
2121+
.. ipython:: python
2122+
2123+
s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
2124+
build_table_schema(s_cat)
2125+
2126+
- A ``primaryKey`` field, containing an array of labels, is included
2127+
*if the index is unique*:
2128+
2129+
.. ipython:: python
2130+
2131+
s_dupe = pd.Series([1, 2], index=[1, 1])
2132+
build_table_schema(s_dupe)
2133+
2134+
- The ``primaryKey`` behavior is the same with MultiIndexes, but in this
2135+
case the ``primaryKey`` is an array:
2136+
2137+
.. ipython:: python
2138+
2139+
s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
2140+
(0, 1)]))
2141+
build_table_schema(s_multi)
2142+
2143+
- The default naming roughly follows these rules:
2144+
2145+
+ For series, the ``object.name`` is used. If that's none, then the
2146+
name is ``values``
2147+
+ For DataFrames, the stringified version of the column name is used
2148+
+ For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
2149+
fallback to ``index`` if that is None.
2150+
+ For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
2151+
then ``level_<i>`` is used.
2152+
2153+
2154+
_Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2155+
20362156
HTML
20372157
----
20382158

doc/source/options.rst

+21
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,9 @@ display.width 80 Width of the display in charact
397397
IPython qtconsole, or IDLE do not run in a
398398
terminal and hence it is not possible
399399
to correctly detect the width.
400+
display.html.table_schema False Whether to publish a Table Schema
401+
representation for frontends that
402+
support it.
400403
html.border 1 A ``border=value`` attribute is
401404
inserted in the ``<table>`` tag
402405
for the DataFrame HTML repr.
@@ -424,6 +427,7 @@ mode.use_inf_as_null False True means treat None, NaN, -IN
424427
are not null (new way).
425428
=================================== ============ ==================================
426429

430+
427431
.. _basics.console_output:
428432

429433
Number Formatting
@@ -512,3 +516,20 @@ Enabling ``display.unicode.ambiguous_as_wide`` lets pandas to figure these chara
512516
513517
pd.set_option('display.unicode.east_asian_width', False)
514518
pd.set_option('display.unicode.ambiguous_as_wide', False)
519+
520+
.. _options.table_schema:
521+
522+
Table Schema Display
523+
--------------------
524+
525+
.. versionadded:: 0.20.0
526+
527+
``DataFrame`` and ``Series`` will publish a Table Schema representation
528+
by default. False by default, this can be enabled globally with the
529+
``display.html.table_schema`` option:
530+
531+
.. ipython:: python
532+
533+
pd.set_option('display.html.table_schema', True)
534+
535+
Only ``'display.max_rows'`` are serialized and published.

doc/source/whatsnew/v0.20.0.txt

+35
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Highlights include:
1212
- Building pandas for development now requires ``cython >= 0.23`` (:issue:`14831`)
1313
- The ``.ix`` indexer has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_ix>`
1414
- Switched the test framework to `pytest`_ (:issue:`13097`)
15+
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref: `here <whatsnew_0200.enhancements.table_schema>`
1516

1617
.. _pytest: http://doc.pytest.org/en/latest/
1718

@@ -154,6 +155,40 @@ New Behavior:
154155

155156
df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()
156157

158+
.. _whatsnew_0200.enhancements.table_schema
159+
160+
Table Schema Output
161+
^^^^^^^^^^^^^^^^^^^
162+
163+
The new orient ``'table'`` for :meth:`DataFrame.to_json`
164+
will generate a `Table Schema`_ compatible string representation of
165+
the data.
166+
167+
.. ipython:: python
168+
169+
df = pd.DataFrame(
170+
{'A': [1, 2, 3],
171+
'B': ['a', 'b', 'c'],
172+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
173+
}, index=pd.Index(range(3), name='idx'))
174+
df
175+
df.to_json(orient='table')
176+
177+
178+
See :ref:`IO: Table Schema for more<io.table_schema>`.
179+
180+
Additionally, the repr for ``DataFrame`` and ``Series`` can now publish
181+
this JSON Table schema representation of the Series or DataFrame if you are
182+
using IPython (or another frontend like `nteract`_ using the Jupyter messaging
183+
protocol).
184+
This gives frontends like the Jupyter notebook and `nteract`_
185+
more flexiblity in how they display pandas objects, since they have
186+
more information about the data.
187+
You must enable this by setting the ``display.html.table_schema`` option to True.
188+
189+
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
190+
.. _nteract: http://nteract.io/
191+
157192
.. _whatsnew_0200.enhancements.other:
158193

159194
Other enhancements

pandas/core/config_init.py

+10
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,13 @@
164164
(default: False)
165165
"""
166166

167+
pc_table_schema_doc = """
168+
: boolean
169+
Whether to publish a Table Schema representation for frontends
170+
that support it.
171+
(default: False)
172+
"""
173+
167174
pc_line_width_deprecation_warning = """\
168175
line_width has been deprecated, use display.width instead (currently both are
169176
identical)
@@ -366,6 +373,9 @@ def mpl_style_cb(key):
366373
validator=is_text)
367374
cf.register_option('latex.multirow', False, pc_latex_multirow,
368375
validator=is_bool)
376+
cf.register_option('html.table_schema', False, pc_table_schema_doc,
377+
validator=is_bool)
378+
369379

370380
cf.deprecate_option('display.line_width',
371381
msg=pc_line_width_deprecation_warning,

pandas/core/generic.py

+82-4
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import operator
55
import weakref
66
import gc
7+
import json
78

89
import numpy as np
910
import pandas.lib as lib
@@ -129,6 +130,37 @@ def __init__(self, data, axes=None, copy=False, dtype=None,
129130
object.__setattr__(self, '_data', data)
130131
object.__setattr__(self, '_item_cache', {})
131132

133+
def _ipython_display_(self):
134+
try:
135+
from IPython.display import display
136+
except ImportError:
137+
return None
138+
139+
# Series doesn't define _repr_html_ or _repr_latex_
140+
latex = self._repr_latex_() if hasattr(self, '_repr_latex_') else None
141+
html = self._repr_html_() if hasattr(self, '_repr_html_') else None
142+
table_schema = self._repr_table_schema_()
143+
# We need the inital newline since we aren't going through the
144+
# usual __repr__. See
145+
# https://github.com/pandas-dev/pandas/pull/14904#issuecomment-277829277
146+
text = "\n" + repr(self)
147+
148+
reprs = {"text/plain": text, "text/html": html, "text/latex": latex,
149+
"application/vnd.dataresource+json": table_schema}
150+
reprs = {k: v for k, v in reprs.items() if v}
151+
display(reprs, raw=True)
152+
153+
def _repr_table_schema_(self):
154+
"""
155+
Not a real Jupyter special repr method, but we use the same
156+
naming convention.
157+
"""
158+
if config.get_option("display.html.table_schema"):
159+
data = self.head(config.get_option('display.max_rows'))
160+
payload = json.loads(data.to_json(orient='table'),
161+
object_pairs_hook=collections.OrderedDict)
162+
return payload
163+
132164
def _validate_dtype(self, dtype):
133165
""" validate the passed dtype """
134166

@@ -1094,7 +1126,7 @@ def __setstate__(self, state):
10941126
strings before writing.
10951127
"""
10961128

1097-
def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
1129+
def to_json(self, path_or_buf=None, orient=None, date_format=None,
10981130
double_precision=10, force_ascii=True, date_unit='ms',
10991131
default_handler=None, lines=False):
11001132
"""
@@ -1129,10 +1161,17 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11291161
- index : dict like {index -> {column -> value}}
11301162
- columns : dict like {column -> {index -> value}}
11311163
- values : just the values array
1164+
- table : dict like {'schema': {schema}, 'data': {data}}
1165+
describing the data, and the data component is
1166+
like ``orient='records'``.
11321167
1133-
date_format : {'epoch', 'iso'}
1168+
.. versionchanged:: 0.20.0
1169+
1170+
date_format : {None, 'epoch', 'iso'}
11341171
Type of date conversion. `epoch` = epoch milliseconds,
1135-
`iso`` = ISO8601, default is epoch.
1172+
`iso` = ISO8601. The default depends on the `orient`. For
1173+
`orient='table'`, the default is `'iso'`. For all other orients,
1174+
the default is `'epoch'`.
11361175
double_precision : The number of decimal places to use when encoding
11371176
floating point values, default 10.
11381177
force_ascii : force encoded string to be ASCII, default True.
@@ -1151,14 +1190,53 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11511190
11521191
.. versionadded:: 0.19.0
11531192
1154-
11551193
Returns
11561194
-------
11571195
same type as input object with filtered info axis
11581196
1197+
See Also
1198+
--------
1199+
pd.read_json
1200+
1201+
Examples
1202+
--------
1203+
1204+
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
1205+
... index=['row 1', 'row 2'],
1206+
... columns=['col 1', 'col 2'])
1207+
>>> df.to_json(orient='split')
1208+
'{"columns":["col 1","col 2"],
1209+
"index":["row 1","row 2"],
1210+
"data":[["a","b"],["c","d"]]}'
1211+
1212+
Encoding/decoding a Dataframe using ``'index'`` formatted JSON:
1213+
1214+
>>> df.to_json(orient='index')
1215+
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
1216+
1217+
Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
1218+
Note that index labels are not preserved with this encoding.
1219+
1220+
>>> df.to_json(orient='records')
1221+
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
1222+
1223+
Encoding with Table Schema
1224+
1225+
>>> df.to_json(orient='table')
1226+
'{"schema": {"fields": [{"name": "index", "type": "string"},
1227+
{"name": "col 1", "type": "string"},
1228+
{"name": "col 2", "type": "string"}],
1229+
"primaryKey": "index",
1230+
"pandas_version": "0.20.0"},
1231+
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
1232+
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
11591233
"""
11601234

11611235
from pandas.io import json
1236+
if date_format is None and orient == 'table':
1237+
date_format = 'iso'
1238+
elif date_format is None:
1239+
date_format = 'epoch'
11621240
return json.to_json(path_or_buf=path_or_buf, obj=self, orient=orient,
11631241
date_format=date_format,
11641242
double_precision=double_precision,

pandas/io/json/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from .json import to_json, read_json, loads, dumps # noqa
22
from .normalize import json_normalize # noqa
3+
from .table_schema import build_table_schema # noqa
34

4-
del json, normalize # noqa
5+
del json, normalize, table_schema # noqa

0 commit comments

Comments
 (0)