Skip to content

Commit 2a8652a

Browse files
committed
ENH: Added to_json_schema
Lays the groundwork for pandas-dev#14386 This handles the schema part of the request there. We'll still need to do the work to publish the data to the frontend, but that can be done as a followup. DOC: More notes in prose docs Move files use isoformat updates Moved to to_json json_table no config refactor with classes Added duration tests more timedelta Change default orient Series test fixup docs JSON Table -> Table doc Change to table orient added version Handle Categorical Many more tests
1 parent 7d6afc4 commit 2a8652a

File tree

5 files changed

+894
-17
lines changed

5 files changed

+894
-17
lines changed

doc/source/io.rst

+123
Original file line numberDiff line numberDiff line change
@@ -2033,6 +2033,129 @@ using Hadoop or Spark.
20332033
df
20342034
df.to_json(orient='records', lines=True)
20352035
2036+
2037+
Table Schema
2038+
''''''''''''
2039+
2040+
`Table Schema`_ is a spec for describing tabular datasets as a JSON
2041+
object. The JSON includes information on the field names, types, and
2042+
other attributes. You can use the orient ``'table'`` to build
2043+
a JSON string with two fields, ``scehma`` and ``data``.
2044+
2045+
.. ipython:: python
2046+
2047+
df = pd.DataFrame(
2048+
{'A': [1, 2, 3],
2049+
'B': ['a', 'b', 'c'],
2050+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
2051+
}, index=pd.Index(range(3), name='idx'))
2052+
df
2053+
df.to_json(orient='table', date_format="iso")
2054+
2055+
The ``schema`` field contains the ``'fields'`` keys, which itself contains
2056+
a list of column name to type pairs, including the Index or MultiIndex
2057+
(see below for a list of types).
2058+
The ``schema`` field also contains a ``primary_key`` field if the (Multi)index
2059+
is unique.
2060+
2061+
The second field, ``data``, contains the serialized data with the ``records``
2062+
orient.
2063+
The index is included, and any datetimes are ISO 8601 formatted, as required
2064+
by the Table Schema spec. You must pass the ``date_format='iso'`` parameter.
2065+
2066+
The full list of types supported are described in the Table Schema
2067+
spec. This table shows the mapping from pandas types:
2068+
2069+
============== =================
2070+
Pandas type Table Schema type
2071+
============== =================
2072+
int64 integer
2073+
float64 number
2074+
bool boolean
2075+
datetime64[ns] date
2076+
timedelta64[ns] duration
2077+
categorical any
2078+
object str
2079+
=============== =================
2080+
2081+
A few notes on the generated table schema:
2082+
2083+
.. ipython:: python
2084+
:suppress:
2085+
from pandas.io.json import build_table_schema
2086+
2087+
.. warning::
2088+
2089+
The code examples below use a method ``build_table_schema``, that's
2090+
not yet part of the public API. Let us know if you think it'd be useful.
2091+
2092+
- The ``schema`` object contains a ``pandas_version`` field. This contains
2093+
the version of pandas dialect on the schema, and will be incremented
2094+
with each revision.
2095+
- All dates are converted to UTC when serializing. Even timezone naïve values
2096+
which are treated as UTC with an offset of 0.
2097+
2098+
.. ipython:: python:
2099+
2100+
s = pd.Series(pd.date_range('2016', periods=4))
2101+
build_table_schema(s)
2102+
2103+
- datetimes with a timezone (before serializing), include an additional field
2104+
``tz`` with the time zone name (e.g. ``'US/Central'``.
2105+
2106+
.. ipython:: python
2107+
2108+
s_tz = pd.Series(pd.date_range('2016', periods=12,
2109+
tz='US/Central'))
2110+
build_table_schema(s_tz)
2111+
2112+
- Periods are converted to timestamps before serialization, and so have the
2113+
same behavior of being converted to UTC. In addition, periods will contain
2114+
and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``
2115+
2116+
.. ipython:: python
2117+
2118+
s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
2119+
periods=4))
2120+
build_table_schema(s_per)
2121+
2122+
- Categoricals use the ``any`` type and an ``enum`` constraint listing
2123+
the set of possible values. Additionally, an ``ordered`` field is included
2124+
2125+
.. ipython:: python
2126+
2127+
s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
2128+
build_table_schema(s_cat)
2129+
2130+
- A ``primary_key`` field is included *if the index is unique*:
2131+
2132+
.. ipython:: python
2133+
2134+
s_dupe = pd.Series([1, 2], index=[1, 1])
2135+
build_table_schema(s_dupe)
2136+
2137+
- The ``primary_key`` behavior is the same with MultiIndexes, but in this
2138+
case the ``primary_key`` is an array:
2139+
2140+
.. ipython:: python
2141+
2142+
s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
2143+
(0, 1)]))
2144+
build_table_schema(s_multi)
2145+
2146+
- The default naming roughly follows these rules:
2147+
2148+
+ For series, the ``object.name`` is used. If that's none, then the
2149+
name is ``values``
2150+
+ For DataFrames, the stringified version of the column name is used
2151+
+ For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
2152+
fallback to ``index`` if that is None.
2153+
+ For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
2154+
then ``level_<i>`` is used.
2155+
2156+
2157+
_Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2158+
20362159
HTML
20372160
----
20382161

doc/source/whatsnew/v0.20.0.txt

+20
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,26 @@ Notably, a new numerical index, ``UInt64Index``, has been created (:issue:`14937
114114
- Bug in ``pd.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14915`)
115115
- Bug in ``pd.value_counts()`` in which unsigned 64-bit integers were being erroneously truncated in the output (:issue:`14934`)
116116

117+
118+
Table Schema Output
119+
^^^^^^^^^^^^^^^^^^^
120+
121+
The new orient ``'table'`` for :meth:`DataFrame.to_json`
122+
will generate a `Table Schema`_ compatible string representation of
123+
the data.
124+
125+
.. ipython:: python
126+
127+
df = pd.DataFrame(
128+
{'A': [1, 2, 3],
129+
'B': ['a', 'b', 'c'],
130+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
131+
}, index=pd.Index(range(3), name='idx'))
132+
df
133+
df.to_json(orient='table')
134+
135+
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
136+
117137
.. _whatsnew_0200.enhancements.other:
118138

119139
Other enhancements

pandas/core/generic.py

+48-3
Original file line numberDiff line numberDiff line change
@@ -1075,8 +1075,9 @@ def __setstate__(self, state):
10751075
"""
10761076

10771077
def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
1078-
double_precision=10, force_ascii=True, date_unit='ms',
1079-
default_handler=None, lines=False):
1078+
timedelta_format='epoch', double_precision=10,
1079+
force_ascii=True, date_unit='ms', default_handler=None,
1080+
lines=False):
10801081
"""
10811082
Convert the object to a JSON string.
10821083
@@ -1109,10 +1110,19 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11091110
- index : dict like {index -> {column -> value}}
11101111
- columns : dict like {column -> {index -> value}}
11111112
- values : just the values array
1113+
- ``'table'`` dict like
1114+
``{'schema': [schema], 'data': [data]}``
1115+
where the schema component is a Table Schema
1116+
describing the data, and the data component is
1117+
like ``orient='records'``.
1118+
1119+
.. versionchanged:: 0.20.0
11121120
11131121
date_format : {'epoch', 'iso'}
11141122
Type of date conversion. `epoch` = epoch milliseconds,
1115-
`iso`` = ISO8601, default is epoch.
1123+
`iso` = ISO8601. Default is epoch, except when orient is
1124+
table_schema, in which case this parameter is ignored
1125+
and iso formatting is always used.
11161126
double_precision : The number of decimal places to use when encoding
11171127
floating point values, default 10.
11181128
force_ascii : force encoded string to be ASCII, default True.
@@ -1136,6 +1146,41 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
11361146
-------
11371147
same type as input object with filtered info axis
11381148
1149+
See Also
1150+
--------
1151+
pd.read_json
1152+
1153+
Examples
1154+
--------
1155+
1156+
>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
1157+
... index=['row 1', 'row 2'],
1158+
... columns=['col 1', 'col 2'])
1159+
>>> df.to_json(orient='split')
1160+
'{"columns":["col 1","col 2"],
1161+
"index":["row 1","row 2"],
1162+
"data":[["a","b"],["c","d"]]}'
1163+
1164+
Encoding/decoding a Dataframe using ``'index'`` formatted JSON:
1165+
1166+
>>> df.to_json(orient='index')
1167+
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
1168+
1169+
Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
1170+
Note that index labels are not preserved with this encoding.
1171+
1172+
>>> df.to_json(orient='records')
1173+
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
1174+
1175+
Encoding with Table Schema
1176+
1177+
>>> df.to_json(orient='table_schema')
1178+
'{"schema": {"fields": [{"name": "index", "type": "string"},
1179+
{"name": "col 1", "type": "string"},
1180+
{"name": "col 2", "type": "string"}],
1181+
"primary_key": "index"},
1182+
"data": [{"index":"row 1","col 1":"a","col 2":"b"},
1183+
{"index":"row 2","col 1":"c","col 2":"d"}]}'
11391184
"""
11401185

11411186
from pandas.io import json

0 commit comments

Comments
 (0)