Skip to content

Commit 572e254

Browse files
committed
ENH: Added to_json_schema
Lays the groundwork for pandas-dev#14386 This handles the schema part of the request there. We'll still need to do the work to publish the data to the frontend, but that can be done as a followup. DOC: More notes in prose docs Move files use isoformat updates Moved to to_json json_table no config refactor with classes Added duration tests more timedelta Change default orient Series test fixup docs JSON Table -> Table doc Change to table orient added version Handle Categorical Many more tests
1 parent 34cdfa4 commit 572e254

File tree

9 files changed

+912
-17
lines changed

9 files changed

+912
-17
lines changed

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ JSON
6060
:toctree: generated/
6161

6262
json_normalize
63+
build_table_schema
6364

6465
.. currentmodule:: pandas
6566

doc/source/io.rst

+119
Original file line numberDiff line numberDiff line change
@@ -2033,6 +2033,125 @@ using Hadoop or Spark.
20332033
df
20342034
df.to_json(orient='records', lines=True)
20352035
2036+
2037+
.. _io.table_schema:
2038+
2039+
Table Schema
2040+
''''''''''''
2041+
2042+
.. versionadded:: 0.20.0
2043+
2044+
`Table Schema`_ is a spec for describing tabular datasets as a JSON
2045+
object. The JSON includes information on the field names, types, and
2046+
other attributes. You can use the orient ``table`` to build
2047+
a JSON string with two fields, ``schema`` and ``data``.
2048+
2049+
.. ipython:: python
2050+
2051+
df = pd.DataFrame(
2052+
{'A': [1, 2, 3],
2053+
'B': ['a', 'b', 'c'],
2054+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
2055+
}, index=pd.Index(range(3), name='idx'))
2056+
df
2057+
df.to_json(orient='table', date_format="iso")
2058+
2059+
The ``schema`` field contains the ``fields`` key, which itself contains
2060+
a list of column name to type pairs, including the ``Index`` or ``MultiIndex``
2061+
(see below for a list of types).
2062+
The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index
2063+
is unique.
2064+
2065+
The second field, ``data``, contains the serialized data with the ``records``
2066+
orient.
2067+
The index is included, and any datetimes are ISO 8601 formatted, as required
2068+
by the Table Schema spec.
2069+
2070+
The full list of types supported are described in the Table Schema
2071+
spec. This table shows the mapping from pandas types:
2072+
2073+
============== =================
2074+
Pandas type Table Schema type
2075+
============== =================
2076+
int64 integer
2077+
float64 number
2078+
bool boolean
2079+
datetime64[ns] date
2080+
timedelta64[ns] duration
2081+
categorical any
2082+
object str
2083+
=============== =================
2084+
2085+
A few notes on the generated table schema:
2086+
2087+
- The ``schema`` object contains a ``pandas_version`` field. This contains
2088+
the version of pandas' dialect of the schema, and will be incremented
2089+
with each revision.
2090+
- All dates are converted to UTC when serializing. Even timezone naïve values
2091+
which are treated as UTC with an offset of 0.
2092+
2093+
.. ipython:: python:
2094+
2095+
from pandas.io.json import build_table_schema
2096+
s = pd.Series(pd.date_range('2016', periods=4))
2097+
build_table_schema(s)
2098+
2099+
- datetimes with a timezone (before serializing), include an additional field
2100+
``tz`` with the time zone name (e.g. ``'US/Central'``).
2101+
2102+
.. ipython:: python
2103+
2104+
s_tz = pd.Series(pd.date_range('2016', periods=12,
2105+
tz='US/Central'))
2106+
build_table_schema(s_tz)
2107+
2108+
- Periods are converted to timestamps before serialization, and so have the
2109+
same behavior of being converted to UTC. In addition, periods will contain
2110+
and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``
2111+
2112+
.. ipython:: python
2113+
2114+
s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
2115+
periods=4))
2116+
build_table_schema(s_per)
2117+
2118+
- Categoricals use the ``any`` type and an ``enum`` constraint listing
2119+
the set of possible values. Additionally, an ``ordered`` field is included
2120+
2121+
.. ipython:: python
2122+
2123+
s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
2124+
build_table_schema(s_cat)
2125+
2126+
- A ``primaryKey`` field is included *if the index is unique*:
2127+
2128+
.. ipython:: python
2129+
2130+
s_dupe = pd.Series([1, 2], index=[1, 1])
2131+
build_table_schema(s_dupe)
2132+
2133+
- The ``primaryKey`` behavior is the same with MultiIndexes, but in this
2134+
case the ``primaryKey`` is an array:
2135+
2136+
.. ipython:: python
2137+
2138+
s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
2139+
(0, 1)]))
2140+
build_table_schema(s_multi)
2141+
2142+
- The default naming roughly follows these rules:
2143+
2144+
+ For series, the ``object.name`` is used. If that's none, then the
2145+
name is ``values``
2146+
+ For DataFrames, the stringified version of the column name is used
2147+
+ For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
2148+
fallback to ``index`` if that is None.
2149+
+ For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
2150+
then ``level_<i>`` is used.
2151+
2152+
2153+
_Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2154+
20362155
HTML
20372156
----
20382157

doc/source/options.rst

+17
Original file line numberDiff line numberDiff line change
@@ -507,3 +507,20 @@ Enabling ``display.unicode.ambiguous_as_wide`` lets pandas to figure these chara
507507
508508
pd.set_option('display.unicode.east_asian_width', False)
509509
pd.set_option('display.unicode.ambiguous_as_wide', False)
510+
511+
.. _options.table_schema:
512+
513+
Table Schema Display
514+
--------------------
515+
516+
.. versionadded:: 0.20.0
517+
518+
``DataFrame`` and ``Series`` will publish a Table Schema representation
519+
by default. This can be disabled globally with the ``display.table_schema``
520+
option:
521+
522+
.. ipython:: python
523+
524+
pd.set_option('display.html.table_schema', False)
525+
526+
By default, only ``'display.max_rows'`` are serialized and published.

doc/source/whatsnew/v0.20.0.txt

+23
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Highlights include:
1111

1212
- Building pandas for development now requires ``cython >= 0.23`` (:issue:`14831`)
1313
- The ``.ix`` indexer has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_ix>`
14+
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref: `here <whatsnew_0200.enhancements.table_schema>`
1415

1516
Check the :ref:`API Changes <whatsnew_0200.api_breaking>` and :ref:`deprecations <whatsnew_0200.deprecations>` before updating.
1617

@@ -117,6 +118,28 @@ Notably, a new numerical index, ``UInt64Index``, has been created (:issue:`14937
117118
- Bug in ``pd.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14915`)
118119
- Bug in ``pd.value_counts()`` in which unsigned 64-bit integers were being erroneously truncated in the output (:issue:`14934`)
119120

121+
122+
.. _whatsnew_0200.enhancements.table_schema
123+
124+
Table Schema Output
125+
^^^^^^^^^^^^^^^^^^^
126+
127+
The new orient ``'table'`` for :meth:`DataFrame.to_json`
128+
will generate a `Table Schema`_ compatible string representation of
129+
the data.
130+
131+
.. ipython:: python
132+
133+
df = pd.DataFrame(
134+
{'A': [1, 2, 3],
135+
'B': ['a', 'b', 'c'],
136+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
137+
}, index=pd.Index(range(3), name='idx'))
138+
df
139+
df.to_json(orient='table')
140+
141+
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
142+
120143
.. _whatsnew_0200.enhancements.other:
121144

122145
Other enhancements

pandas/core/generic.py

+49-3
Original file line numberDiff line numberDiff line change
@@ -1075,8 +1075,9 @@ def __setstate__(self, state):
10751075
"""
10761076

10771077
def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
1078-
double_precision=10, force_ascii=True, date_unit='ms',
1079-
default_handler=None, lines=False):
1078+
timedelta_format='epoch', double_precision=10,
1079+
force_ascii=True, date_unit='ms', default_handler=None,
1080+
lines=False):
10801081
"""
10811082
Convert the object to a JSON string.
10821083
@@ -1109,10 +1110,19 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11091110
- index : dict like {index -> {column -> value}}
11101111
- columns : dict like {column -> {index -> value}}
11111112
- values : just the values array
1113+
- ``'table'`` dict like
1114+
``{'schema': [schema], 'data': [data]}``
1115+
where the schema component is a Table Schema
1116+
describing the data, and the data component is
1117+
like ``orient='records'``.
1118+
1119+
.. versionchanged:: 0.20.0
11121120
11131121
date_format : {'epoch', 'iso'}
11141122
Type of date conversion. `epoch` = epoch milliseconds,
1115-
`iso`` = ISO8601, default is epoch.
1123+
`iso` = ISO8601. Default is epoch, except when orient is
1124+
table_schema, in which case this parameter is ignored
1125+
and iso formatting is always used.
11161126
double_precision : The number of decimal places to use when encoding
11171127
floating point values, default 10.
11181128
force_ascii : force encoded string to be ASCII, default True.
@@ -1136,6 +1146,42 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11361146
-------
11371147
same type as input object with filtered info axis
11381148
1149+
See Also
1150+
--------
1151+
pd.read_json
1152+
1153+
Examples
1154+
--------
1155+
1156+
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
1157+
... index=['row 1', 'row 2'],
1158+
... columns=['col 1', 'col 2'])
1159+
>>> df.to_json(orient='split')
1160+
'{"columns":["col 1","col 2"],
1161+
"index":["row 1","row 2"],
1162+
"data":[["a","b"],["c","d"]]}'
1163+
1164+
Encoding/decoding a Dataframe using ``'index'`` formatted JSON:
1165+
1166+
>>> df.to_json(orient='index')
1167+
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
1168+
1169+
Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
1170+
Note that index labels are not preserved with this encoding.
1171+
1172+
>>> df.to_json(orient='records')
1173+
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
1174+
1175+
Encoding with Table Schema
1176+
1177+
>>> df.to_json(orient='table')
1178+
'{"schema": {"fields": [{"name": "index", "type": "string"},
1179+
{"name": "col 1", "type": "string"},
1180+
{"name": "col 2", "type": "string"}],
1181+
"primaryKey": "index",
1182+
"pandas_version": "0.20.0"},
1183+
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
1184+
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
11391185
"""
11401186

11411187
from pandas.io import json

pandas/io/json/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from .json import to_json, read_json, loads, dumps # noqa
22
from .normalize import json_normalize # noqa
3+
from .table_schema import build_table_schema # noqa
34

4-
del json, normalize # noqa
5+
del json, normalize, table_schema # noqa

0 commit comments

Comments
 (0)