@@ -2033,6 +2033,129 @@ using Hadoop or Spark.
2033
2033
df
2034
2034
df.to_json(orient = ' records' , lines = True )
2035
2035
2036
+
2037
+ Table Schema
2038
+ ''''''''''''
2039
+
2040
+ `Table Schema `_ is a spec for describing tabular datasets as a JSON
2041
+ object. The JSON includes information on the field names, types, and
2042
+ other attributes. You can use the orient ``'table' `` to build
2043
+ a JSON string with two fields, ``scehma `` and ``data ``.
2044
+
2045
+ .. ipython :: python
2046
+
2047
+ df = pd.DataFrame(
2048
+ {' A' : [1 , 2 , 3 ],
2049
+ ' B' : [' a' , ' b' , ' c' ],
2050
+ ' C' : pd.date_range(' 2016-01-01' , freq = ' d' , periods = 3 ),
2051
+ }, index = pd.Index(range (3 ), name = ' idx' ))
2052
+ df
2053
+ df.to_json(orient = ' table' , date_format = " iso" )
2054
+
2055
+ The ``schema `` field contains the ``'fields' `` keys, which itself contains
2056
+ a list of column name to type pairs, including the Index or MultiIndex
2057
+ (see below for a list of types).
2058
+ The ``schema `` field also contains a ``primary_key `` field if the (Multi)index
2059
+ is unique.
2060
+
2061
+ The second field, ``data ``, contains the serialized data with the ``records ``
2062
+ orient.
2063
+ The index is included, and any datetimes are ISO 8601 formatted, as required
2064
+ by the Table Schema spec. You must pass the ``date_format='iso' `` parameter.
2065
+
2066
+ The full list of types supported are described in the Table Schema
2067
+ spec. This table shows the mapping from pandas types:
2068
+
2069
+ ============== =================
2070
+ Pandas type Table Schema type
2071
+ ============== =================
2072
+ int64 integer
2073
+ float64 number
2074
+ bool boolean
2075
+ datetime64[ns] date
2076
+ timedelta64[ns] duration
2077
+ categorical any
2078
+ object str
2079
+ =============== =================
2080
+
2081
+ A few notes on the generated table schema:
2082
+
2083
+ .. ipython :: python
2084
+ :suppress:
2085
+ from pandas.io.json import build_table_schema
2086
+
2087
+ .. warning ::
2088
+
2089
+ The code examples below use a method ``build_table_schema ``, that's
2090
+ not yet part of the public API. Let us know if you think it'd be useful.
2091
+
2092
+ - The ``schema `` object contains a ``pandas_version `` field. This contains
2093
+ the version of pandas dialect on the schema, and will be incremented
2094
+ with each revision.
2095
+ - All dates are converted to UTC when serializing. Even timezone naïve values
2096
+ which are treated as UTC with an offset of 0.
2097
+
2098
+ .. ipython :: python:
2099
+
2100
+ s = pd.Series(pd.date_range('2016', periods=4))
2101
+ build_table_schema(s)
2102
+
2103
+ - datetimes with a timezone (before serializing), include an additional field
2104
+ ``tz `` with the time zone name (e.g. ``'US/Central' ``.
2105
+
2106
+ .. ipython :: python
2107
+
2108
+ s_tz = pd.Series(pd.date_range(' 2016' , periods = 12 ,
2109
+ tz = ' US/Central' ))
2110
+ build_table_schema(s_tz)
2111
+
2112
+ - Periods are converted to timestamps before serialization, and so have the
2113
+ same behavior of being converted to UTC. In addition, periods will contain
2114
+ and additional field ``freq `` with the period's frequency, e.g. ``'A-DEC' ``
2115
+
2116
+ .. ipython :: python
2117
+
2118
+ s_per = pd.Series(1 , index = pd.period_range(' 2016' , freq = ' A-DEC' ,
2119
+ periods = 4 ))
2120
+ build_table_schema(s_per)
2121
+
2122
+ - Categoricals use the ``any `` type and an ``enum `` constraint listing
2123
+ the set of possible values. Additionally, an ``ordered `` field is included
2124
+
2125
+ .. ipython :: python
2126
+
2127
+ s_cat = pd.Series(pd.Categorical([' a' , ' b' , ' a' ]))
2128
+ build_table_schema(s_cat)
2129
+
2130
+ - A ``primary_key `` field is included *if the index is unique *:
2131
+
2132
+ .. ipython :: python
2133
+
2134
+ s_dupe = pd.Series([1 , 2 ], index = [1 , 1 ])
2135
+ build_table_schema(s_dupe)
2136
+
2137
+ - The ``primary_key `` behavior is the same with MultiIndexes, but in this
2138
+ case the ``primary_key `` is an array:
2139
+
2140
+ .. ipython :: python
2141
+
2142
+ s_multi = pd.Series(1 , index = pd.MultiIndex.from_product([(' a' , ' b' ),
2143
+ (0 , 1 )]))
2144
+ build_table_schema(s_multi)
2145
+
2146
+ - The default naming roughly follows these rules:
2147
+
2148
+ + For series, the ``object.name `` is used. If that's none, then the
2149
+ name is ``values ``
2150
+ + For DataFrames, the stringified version of the column name is used
2151
+ + For ``Index `` (not ``MultiIndex ``), ``index.name `` is used, with a
2152
+ fallback to ``index `` if that is None.
2153
+ + For ``MultiIndex ``, ``mi.names `` is used. If any level has no name,
2154
+ then ``level_<i> `` is used.
2155
+
2156
+
2157
+ _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2158
+
2036
2159
HTML
2037
2160
----
2038
2161
0 commit comments