ENH: Added to_json_schema

TomAugspurger · TomAugspurger · commit 902f5187a60d · 2017-02-07T16:56:09.000-06:00
Lays the groundwork for pandas-dev#14386 This handles the schema part of the request there. We'll still need to do the work to publish the data to the frontend, but that can be done as a followup. DOC: More notes in prose docs Move files use isoformat updates Moved to to_json json_table no config refactor with classes Added duration tests more timedelta Change default orient Series test fixup docs JSON Table -> Table doc Change to table orient added version Handle Categorical Many more tests
diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -60,6 +60,7 @@ JSON
    :toctree: generated/
 
    json_normalize
+   build_table_schema
 
 .. currentmodule:: pandas
 
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -2033,6 +2033,125 @@ using Hadoop or Spark.
   df
   df.to_json(orient='records', lines=True)
 
+
+.. _io.table_schema:
+
+Table Schema
+''''''''''''
+
+.. versionadded:: 0.20.0
+
+`Table Schema`_ is a spec for describing tabular datasets as a JSON
+object. The JSON includes information on the field names, types, and
+other attributes. You can use the orient ``table`` to build
+a JSON string with two fields, ``schema`` and ``data``.
+
+.. ipython:: python
+
+   df = pd.DataFrame(
+       {'A': [1, 2, 3],
+        'B': ['a', 'b', 'c'],
+        'C': pd.date_range('2016-01-01', freq='d', periods=3),
+       }, index=pd.Index(range(3), name='idx'))
+   df
+   df.to_json(orient='table', date_format="iso")
+
+The ``schema`` field contains the ``fields`` key, which itself contains
+a list of column name to type pairs, including the ``Index`` or ``MultiIndex``
+(see below for a list of types).
+The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index
+is unique.
+
+The second field, ``data``, contains the serialized data with the ``records``
+orient.
+The index is included, and any datetimes are ISO 8601 formatted, as required
+by the Table Schema spec.
+
+The full list of types supported are described in the Table Schema
+spec. This table shows the mapping from pandas types:
+
+==============  =================
+Pandas type     Table Schema type
+==============  =================
+int64           integer
+float64         number
+bool            boolean
+datetime64[ns]  date
+timedelta64[ns] duration
+categorical     any
+object          str
+=============== =================
+
+A few notes on the generated table schema:
+
+- The ``schema`` object contains a ``pandas_version`` field. This contains
+  the version of pandas' dialect of the schema, and will be incremented
+  with each revision.
+- All dates are converted to UTC when serializing. Even timezone naïve values
+  which are treated as UTC with an offset of 0.
+
+  .. ipython:: python:
+
+     from pandas.io.json import build_table_schema
+     s = pd.Series(pd.date_range('2016', periods=4))
+     build_table_schema(s)
+
+- datetimes with a timezone (before serializing), include an additional field
+  ``tz`` with the time zone name (e.g. ``'US/Central'``).
+
+  .. ipython:: python
+
+     s_tz = pd.Series(pd.date_range('2016', periods=12,
+                                    tz='US/Central'))
+     build_table_schema(s_tz)
+
+- Periods are converted to timestamps before serialization, and so have the
+  same behavior of being converted to UTC. In addition, periods will contain
+  and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``
+
+  .. ipython:: python
+
+     s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
+                                                periods=4))
+     build_table_schema(s_per)
+
+- Categoricals use the ``any`` type and an ``enum`` constraint listing
+  the set of possible values. Additionally, an ``ordered`` field is included
+
+  .. ipython:: python
+
+     s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
+     build_table_schema(s_cat)
+
+- A ``primaryKey`` field is included *if the index is unique*:
+
+  .. ipython:: python
+
+     s_dupe = pd.Series([1, 2], index=[1, 1])
+     build_table_schema(s_dupe)
+
+- The ``primaryKey`` behavior is the same with MultiIndexes, but in this
+  case the ``primaryKey`` is an array:
+
+  .. ipython:: python
+
+     s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
+                                                              (0, 1)]))
+     build_table_schema(s_multi)
+
+- The default naming roughly follows these rules:
+
+  + For series, the ``object.name`` is used. If that's none, then the
+    name is ``values``
+  + For DataFrames, the stringified version of the column name is used
+  + For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
+    fallback to ``index`` if that is None.
+  + For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
+    then ``level_<i>`` is used.
+
+
+_Table Schema: http://specs.frictionlessdata.io/json-table-schema/
+
 HTML
 ----
 
diff --git a/doc/source/options.rst b/doc/source/options.rst
@@ -507,3 +507,20 @@ Enabling ``display.unicode.ambiguous_as_wide`` lets pandas to figure these chara
 
    pd.set_option('display.unicode.east_asian_width', False)
    pd.set_option('display.unicode.ambiguous_as_wide', False)
+
+.. _options.table_schema:
+
+Table Schema Display
+--------------------
+
+.. versionadded:: 0.20.0
+
+``DataFrame`` and ``Series`` will publish a Table Schema representation
+by default. This can be disabled globally with the ``display.table_schema``
+option:
+
+.. ipython:: python
+
+  pd.set_option('display.html.table_schema', False)
+
+By default, only ``'display.max_rows'`` are serialized and published.
diff --git a/doc/source/whatsnew/v0.20.0.txt b/doc/source/whatsnew/v0.20.0.txt
@@ -11,6 +11,7 @@ Highlights include:
 
 - Building pandas for development now requires ``cython >= 0.23`` (:issue:`14831`)
 - The ``.ix`` indexer has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_ix>`
+- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref: `here <whatsnew_0200.enhancements.table_schema>`
 
 Check the :ref:`API Changes <whatsnew_0200.api_breaking>` and :ref:`deprecations <whatsnew_0200.deprecations>` before updating.
 
@@ -117,6 +118,28 @@ Notably, a new numerical index, ``UInt64Index``, has been created (:issue:`14937
 - Bug in ``pd.unique()`` in which unsigned 64-bit integers were causing overflow (:issue:`14915`)
 - Bug in ``pd.value_counts()`` in which unsigned 64-bit integers were being erroneously truncated in the output (:issue:`14934`)
 
+
+.. _whatsnew_0200.enhancements.table_schema
+
+Table Schema Output
+^^^^^^^^^^^^^^^^^^^
+
+The new orient ``'table'`` for :meth:`DataFrame.to_json`
+will generate a `Table Schema`_ compatible string representation of
+the data.
+
+.. ipython:: python
+
+   df = pd.DataFrame(
+       {'A': [1, 2, 3],
+        'B': ['a', 'b', 'c'],
+        'C': pd.date_range('2016-01-01', freq='d', periods=3),
+       }, index=pd.Index(range(3), name='idx'))
+   df
+   df.to_json(orient='table')
+
+.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
+
 .. _whatsnew_0200.enhancements.other:
 
 Other enhancements
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
@@ -1075,8 +1075,9 @@ def __setstate__(self, state):
     """
 
     def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
-                double_precision=10, force_ascii=True, date_unit='ms',
-                default_handler=None, lines=False):
+                timedelta_format='epoch', double_precision=10,
+                force_ascii=True, date_unit='ms', default_handler=None,
+                lines=False):
         """
         Convert the object to a JSON string.
 
@@ -1109,10 +1110,19 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
               - index : dict like {index -> {column -> value}}
               - columns : dict like {column -> {index -> value}}
               - values : just the values array
+            - ``'table'`` dict like
+              ``{'schema': [schema], 'data': [data]}``
+              where the schema component is a Table Schema
+              describing the data, and the data component is
+              like ``orient='records'``.
+
+                .. versionchanged:: 0.20.0
 
         date_format : {'epoch', 'iso'}
             Type of date conversion. `epoch` = epoch milliseconds,
-            `iso`` = ISO8601, default is epoch.
+            `iso` = ISO8601. Default is epoch, except when orient is
+            table_schema, in which case this parameter is ignored
+            and iso formatting is always used.
         double_precision : The number of decimal places to use when encoding
             floating point values, default 10.
         force_ascii : force encoded string to be ASCII, default True.
@@ -1136,6 +1146,42 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
         -------
         same type as input object with filtered info axis
 
+        See Also
+        --------
+        pd.read_json
+
+        Examples
+        --------
+
+        >>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
+        ...                   index=['row 1', 'row 2'],
+        ...                   columns=['col 1', 'col 2'])
+        >>> df.to_json(orient='split')
+        '{"columns":["col 1","col 2"],
+          "index":["row 1","row 2"],
+          "data":[["a","b"],["c","d"]]}'
+
+        Encoding/decoding a Dataframe using ``'index'`` formatted JSON:
+
+        >>> df.to_json(orient='index')
+        '{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
+
+        Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
+        Note that index labels are not preserved with this encoding.
+
+        >>> df.to_json(orient='records')
+        '[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
+
+        Encoding with Table Schema
+
+        >>> df.to_json(orient='table')
+        '{"schema": {"fields": [{"name": "index", "type": "string"},
+                                {"name": "col 1", "type": "string"},
+                                {"name": "col 2", "type": "string"}],
+                     "primaryKey": "index",
+                     "pandas_version": "0.20.0"},
+          "data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
+                   {"index": "row 2", "col 1": "c", "col 2": "d"}]}'
         """
 
         from pandas.io import json
diff --git a/pandas/io/json/__init__.py b/pandas/io/json/__init__.py
@@ -1,4 +1,5 @@
 from .json import to_json, read_json, loads, dumps  # noqa
 from .normalize import json_normalize  # noqa
+from .table_schema import build_table_schema  # noqa
 
-del json, normalize  # noqa
+del json, normalize, table_schema  # noqa
diff --git a/pandas/io/json/json.py b/pandas/io/json/json.py
@@ -1,6 +1,8 @@
 # pylint: disable-msg=E1101,W0613,W0603
+from __future__ import absolute_import
 
 import os
+import json
 import numpy as np
 
 import pandas.json as _json
@@ -12,10 +14,14 @@
 from pandas.core.common import AbstractMethodError
 from pandas.formats.printing import pprint_thing
 from .normalize import _convert_to_line_delimits
+from .table_schema import build_table_schema
+from pandas.types.common import is_period_dtype
 
 loads = _json.loads
 dumps = _json.dumps
 
+TABLE_SCHEMA_VERSION = '0.20.0'
+
 
 # interface to/from
 def to_json(path_or_buf, obj, orient=None, date_format='epoch',
@@ -26,19 +32,22 @@ def to_json(path_or_buf, obj, orient=None, date_format='epoch',
         raise ValueError(
             "'lines' keyword only valid when 'orient' is records")
 
-    if isinstance(obj, Series):
-        s = SeriesWriter(
-            obj, orient=orient, date_format=date_format,
-            double_precision=double_precision, ensure_ascii=force_ascii,
-            date_unit=date_unit, default_handler=default_handler).write()
+    if orient == 'table' and isinstance(obj, Series):
+        obj = obj.to_frame(name=obj.name or 'values')
+    if orient == 'table' and isinstance(obj, DataFrame):
+        writer = JSONTableWriter
+    elif isinstance(obj, Series):
+        writer = SeriesWriter
     elif isinstance(obj, DataFrame):
-        s = FrameWriter(
-            obj, orient=orient, date_format=date_format,
-            double_precision=double_precision, ensure_ascii=force_ascii,
-            date_unit=date_unit, default_handler=default_handler).write()
+        writer = FrameWriter
     else:
         raise NotImplementedError("'obj' should be a Series or a DataFrame")
 
+    s = writer(
+        obj, orient=orient, date_format=date_format,
+        double_precision=double_precision, ensure_ascii=force_ascii,
+        date_unit=date_unit, default_handler=default_handler).write()
+
     if lines:
         s = _convert_to_line_delimits(s)
 
@@ -81,7 +90,8 @@ def write(self):
             ensure_ascii=self.ensure_ascii,
             date_unit=self.date_unit,
             iso_dates=self.date_format == 'iso',
-            default_handler=self.default_handler)
+            default_handler=self.default_handler
+        )
 
 
 class SeriesWriter(Writer):
@@ -108,6 +118,49 @@ def _format_axes(self):
                              "'%s'." % self.orient)
 
 
+class JSONTableWriter(FrameWriter):
+    _default_orient = 'records'
+
+    def __init__(self, obj, orient, date_format, double_precision,
+                 ensure_ascii, date_unit, default_handler=None):
+        """
+        Adds a `schema` attribut with the Table Schema, resets
+        the index (can't do in caller, because the schema inference needs
+        to know what the index is, forces orient to records, and forces
+        date_format to 'iso'.
+        """
+        super(JSONTableWriter, self).__init__(
+            obj, orient, date_format, double_precision, ensure_ascii,
+            date_unit, default_handler=default_handler)
+
+        if date_format != 'iso':
+            msg = ("Trying to write with `orient='table'` and "
+                   "`date_format='%s'`. Table Schema requires dates "
+                   "to be formatted with `date_format='iso'`" % date_format)
+            raise ValueError(msg)
+
+        self.schema = build_table_schema(obj)
+        # TODO: Do this timedelta properly in objToJSON.c
+        # See GH #15137
+        obj = obj.copy()
+        timedeltas = obj.select_dtypes(include=['timedelta']).columns
+        obj[timedeltas] = obj[timedeltas].applymap(
+            lambda x: x.isoformat())
+        # Convert PeriodIndex to datetimes before serialzing
+        if is_period_dtype(obj.index):
+            obj.index = obj.index.to_timestamp()
+
+        self.obj = obj.reset_index()
+        self.date_format = 'iso'
+        self.orient = 'records'
+
+    def write(self):
+        data = super(JSONTableWriter, self).write()
+        serialized = '{{"schema": {}, "data": {}}}'.format(
+            json.dumps(self.schema, sort_keys=False), data)
+        return serialized
+
+
 def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
               convert_axes=True, convert_dates=True, keep_default_dates=True,
               numpy=False, precise_float=False, date_unit=None, encoding=None,
@@ -244,6 +297,7 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
       col 1 col 2
     0     a     b
     1     c     d
+
     """
 
     filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf,
diff --git a/pandas/io/json/table_schema.py b/pandas/io/json/table_schema.py
diff --git a/pandas/io/tests/json/test_json_table_schema.py b/pandas/io/tests/json/test_json_table_schema.py