Skip to content

Commit ac25ce5

Browse files
committed
ENH: Added to_json_schema
Lays the groundwork for pandas-dev#14386 This handles the schema part of the request there. We'll still need to do the work to publish the data to the frontend, but that can be done as a followup. DOC: More notes in prose docs Move files use isoformat updates Moved to to_json json_table no config refactor with classes Added duration tests more timedelta Change default orient Series test fixup docs JSON Table -> Table doc Change to table orient added version Handle Categorical Many more tests
1 parent be4a63f commit ac25ce5

File tree

10 files changed

+997
-15
lines changed

10 files changed

+997
-15
lines changed

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ JSON
6060
:toctree: generated/
6161

6262
json_normalize
63+
build_table_schema
6364

6465
.. currentmodule:: pandas
6566

doc/source/io.rst

+119
Original file line numberDiff line numberDiff line change
@@ -2033,6 +2033,125 @@ using Hadoop or Spark.
20332033
df
20342034
df.to_json(orient='records', lines=True)
20352035
2036+
2037+
.. _io.table_schema:
2038+
2039+
Table Schema
2040+
''''''''''''
2041+
2042+
.. versionadded:: 0.20.0
2043+
2044+
`Table Schema`_ is a spec for describing tabular datasets as a JSON
2045+
object. The JSON includes information on the field names, types, and
2046+
other attributes. You can use the orient ``table`` to build
2047+
a JSON string with two fields, ``schema`` and ``data``.
2048+
2049+
.. ipython:: python
2050+
2051+
df = pd.DataFrame(
2052+
{'A': [1, 2, 3],
2053+
'B': ['a', 'b', 'c'],
2054+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
2055+
}, index=pd.Index(range(3), name='idx'))
2056+
df
2057+
df.to_json(orient='table', date_format="iso")
2058+
2059+
The ``schema`` field contains the ``fields`` key, which itself contains
2060+
a list of column name to type pairs, including the ``Index`` or ``MultiIndex``
2061+
(see below for a list of types).
2062+
The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index
2063+
is unique.
2064+
2065+
The second field, ``data``, contains the serialized data with the ``records``
2066+
orient.
2067+
The index is included, and any datetimes are ISO 8601 formatted, as required
2068+
by the Table Schema spec.
2069+
2070+
The full list of types supported are described in the Table Schema
2071+
spec. This table shows the mapping from pandas types:
2072+
2073+
============== =================
2074+
Pandas type Table Schema type
2075+
============== =================
2076+
int64 integer
2077+
float64 number
2078+
bool boolean
2079+
datetime64[ns] date
2080+
timedelta64[ns] duration
2081+
categorical any
2082+
object str
2083+
=============== =================
2084+
2085+
A few notes on the generated table schema:
2086+
2087+
- The ``schema`` object contains a ``pandas_version`` field. This contains
2088+
the version of pandas' dialect of the schema, and will be incremented
2089+
with each revision.
2090+
- All dates are converted to UTC when serializing. Even timezone naïve values
2091+
which are treated as UTC with an offset of 0.
2092+
2093+
.. ipython:: python:
2094+
2095+
from pandas.io.json import build_table_schema
2096+
s = pd.Series(pd.date_range('2016', periods=4))
2097+
build_table_schema(s)
2098+
2099+
- datetimes with a timezone (before serializing), include an additional field
2100+
``tz`` with the time zone name (e.g. ``'US/Central'``).
2101+
2102+
.. ipython:: python
2103+
2104+
s_tz = pd.Series(pd.date_range('2016', periods=12,
2105+
tz='US/Central'))
2106+
build_table_schema(s_tz)
2107+
2108+
- Periods are converted to timestamps before serialization, and so have the
2109+
same behavior of being converted to UTC. In addition, periods will contain
2110+
and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``
2111+
2112+
.. ipython:: python
2113+
2114+
s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
2115+
periods=4))
2116+
build_table_schema(s_per)
2117+
2118+
- Categoricals use the ``any`` type and an ``enum`` constraint listing
2119+
the set of possible values. Additionally, an ``ordered`` field is included
2120+
2121+
.. ipython:: python
2122+
2123+
s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
2124+
build_table_schema(s_cat)
2125+
2126+
- A ``primaryKey`` field is included *if the index is unique*:
2127+
2128+
.. ipython:: python
2129+
2130+
s_dupe = pd.Series([1, 2], index=[1, 1])
2131+
build_table_schema(s_dupe)
2132+
2133+
- The ``primaryKey`` behavior is the same with MultiIndexes, but in this
2134+
case the ``primaryKey`` is an array:
2135+
2136+
.. ipython:: python
2137+
2138+
s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
2139+
(0, 1)]))
2140+
build_table_schema(s_multi)
2141+
2142+
- The default naming roughly follows these rules:
2143+
2144+
+ For series, the ``object.name`` is used. If that's none, then the
2145+
name is ``values``
2146+
+ For DataFrames, the stringified version of the column name is used
2147+
+ For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
2148+
fallback to ``index`` if that is None.
2149+
+ For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
2150+
then ``level_<i>`` is used.
2151+
2152+
2153+
_Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2154+
20362155
HTML
20372156
----
20382157

doc/source/options.rst

+17
Original file line numberDiff line numberDiff line change
@@ -507,3 +507,20 @@ Enabling ``display.unicode.ambiguous_as_wide`` lets pandas to figure these chara
507507
508508
pd.set_option('display.unicode.east_asian_width', False)
509509
pd.set_option('display.unicode.ambiguous_as_wide', False)
510+
511+
.. _options.table_schema:
512+
513+
Table Schema Display
514+
--------------------
515+
516+
.. versionadded:: 0.20.0
517+
518+
``DataFrame`` and ``Series`` will publish a Table Schema representation
519+
by default. This can be disabled globally with the ``display.table_schema``
520+
option:
521+
522+
.. ipython:: python
523+
524+
pd.set_option('display.html.table_schema', False)
525+
526+
By default, only ``'display.max_rows'`` are serialized and published.

doc/source/whatsnew/v0.20.0.txt

+24
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,9 @@ Highlights include:
1111

1212
- Building pandas for development now requires ``cython >= 0.23`` (:issue:`14831`)
1313
- The ``.ix`` indexer has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_ix>`
14+
<<<<<<< HEAD
1415
- Switched the test framework to `pytest`_ (:issue:`13097`)
16+
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref: `here <whatsnew_0200.enhancements.table_schema>`
1517

1618
.. _pytest: http://doc.pytest.org/en/latest/
1719

@@ -120,6 +122,28 @@ Notably, a new numerical index, ``UInt64Index``, has been created (:issue:`14937
120122
- Bug in ``pd.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14915`)
121123
- Bug in ``pd.value_counts()`` in which unsigned 64-bit integers were being erroneously truncated in the output (:issue:`14934`)
122124

125+
126+
.. _whatsnew_0200.enhancements.table_schema
127+
128+
Table Schema Output
129+
^^^^^^^^^^^^^^^^^^^
130+
131+
The new orient ``'table'`` for :meth:`DataFrame.to_json`
132+
will generate a `Table Schema`_ compatible string representation of
133+
the data.
134+
135+
.. ipython:: python
136+
137+
df = pd.DataFrame(
138+
{'A': [1, 2, 3],
139+
'B': ['a', 'b', 'c'],
140+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
141+
}, index=pd.Index(range(3), name='idx'))
142+
df
143+
df.to_json(orient='table')
144+
145+
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
146+
123147
.. _whatsnew_0200.enhancements.other:
124148

125149
Other enhancements

pandas/core/generic.py

+48-3
Original file line numberDiff line numberDiff line change
@@ -1094,8 +1094,9 @@ def __setstate__(self, state):
10941094
"""
10951095

10961096
def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
1097-
double_precision=10, force_ascii=True, date_unit='ms',
1098-
default_handler=None, lines=False):
1097+
timedelta_format='epoch', double_precision=10,
1098+
force_ascii=True, date_unit='ms', default_handler=None,
1099+
lines=False):
10991100
"""
11001101
Convert the object to a JSON string.
11011102
@@ -1128,10 +1129,18 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11281129
- index : dict like {index -> {column -> value}}
11291130
- columns : dict like {column -> {index -> value}}
11301131
- values : just the values array
1132+
- table : dict like {'schema': {schema}, 'data': {data}}
1133+
the schema component is a `Table Schema_`
1134+
describing the data, and the data component is
1135+
like ``orient='records'``.
1136+
1137+
.. versionchanged:: 0.20.0
11311138
11321139
date_format : {'epoch', 'iso'}
11331140
Type of date conversion. `epoch` = epoch milliseconds,
1134-
`iso`` = ISO8601, default is epoch.
1141+
`iso` = ISO8601. Default is epoch, except when orient is
1142+
table_schema, in which case this parameter is ignored
1143+
and iso formatting is always used.
11351144
double_precision : The number of decimal places to use when encoding
11361145
floating point values, default 10.
11371146
force_ascii : force encoded string to be ASCII, default True.
@@ -1155,6 +1164,42 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11551164
-------
11561165
same type as input object with filtered info axis
11571166
1167+
See Also
1168+
--------
1169+
pd.read_json
1170+
1171+
Examples
1172+
--------
1173+
1174+
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
1175+
... index=['row 1', 'row 2'],
1176+
... columns=['col 1', 'col 2'])
1177+
>>> df.to_json(orient='split')
1178+
'{"columns":["col 1","col 2"],
1179+
"index":["row 1","row 2"],
1180+
"data":[["a","b"],["c","d"]]}'
1181+
1182+
Encoding/decoding a Dataframe using ``'index'`` formatted JSON:
1183+
1184+
>>> df.to_json(orient='index')
1185+
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
1186+
1187+
Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
1188+
Note that index labels are not preserved with this encoding.
1189+
1190+
>>> df.to_json(orient='records')
1191+
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
1192+
1193+
Encoding with Table Schema
1194+
1195+
>>> df.to_json(orient='table')
1196+
'{"schema": {"fields": [{"name": "index", "type": "string"},
1197+
{"name": "col 1", "type": "string"},
1198+
{"name": "col 2", "type": "string"}],
1199+
"primaryKey": "index",
1200+
"pandas_version": "0.20.0"},
1201+
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
1202+
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
11581203
"""
11591204

11601205
from pandas.io import json

pandas/io/json/__init__.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from .json import to_json, read_json, loads, dumps # noqa
22
from .normalize import json_normalize # noqa
3+
from .table_schema import build_table_schema # noqa
34

4-
del json, normalize # noqa
5+
del json, normalize, table_schema # noqa

0 commit comments

Comments
 (0)