@@ -2033,6 +2033,126 @@ using Hadoop or Spark.
2033
2033
df
2034
2034
df.to_json(orient = ' records' , lines = True )
2035
2035
2036
+
2037
+ .. _io.table_schema :
2038
+
2039
+ Table Schema
2040
+ ''''''''''''
2041
+
2042
+ .. versionadded :: 0.20.0
2043
+
2044
+ `Table Schema `_ is a spec for describing tabular datasets as a JSON
2045
+ object. The JSON includes information on the field names, types, and
2046
+ other attributes. You can use the orient ``table `` to build
2047
+ a JSON string with two fields, ``schema `` and ``data ``.
2048
+
2049
+ .. ipython :: python
2050
+
2051
+ df = pd.DataFrame(
2052
+ {' A' : [1 , 2 , 3 ],
2053
+ ' B' : [' a' , ' b' , ' c' ],
2054
+ ' C' : pd.date_range(' 2016-01-01' , freq = ' d' , periods = 3 ),
2055
+ }, index = pd.Index(range (3 ), name = ' idx' ))
2056
+ df
2057
+ df.to_json(orient = ' table' , date_format = " iso" )
2058
+
2059
+ The ``schema `` field contains the ``fields `` key, which itself contains
2060
+ a list of column name to type pairs, including the ``Index `` or ``MultiIndex ``
2061
+ (see below for a list of types).
2062
+ The ``schema `` field also contains a ``primaryKey `` field if the (Multi)index
2063
+ is unique.
2064
+
2065
+ The second field, ``data ``, contains the serialized data with the ``records ``
2066
+ orient.
2067
+ The index is included, and any datetimes are ISO 8601 formatted, as required
2068
+ by the Table Schema spec.
2069
+
2070
+ The full list of types supported are described in the Table Schema
2071
+ spec. This table shows the mapping from pandas types:
2072
+
2073
+ ============== =================
2074
+ Pandas type Table Schema type
2075
+ ============== =================
2076
+ int64 integer
2077
+ float64 number
2078
+ bool boolean
2079
+ datetime64[ns] datetime
2080
+ timedelta64[ns] duration
2081
+ categorical any
2082
+ object str
2083
+ =============== =================
2084
+
2085
+ A few notes on the generated table schema:
2086
+
2087
+ - The ``schema `` object contains a ``pandas_version `` field. This contains
2088
+ the version of pandas' dialect of the schema, and will be incremented
2089
+ with each revision.
2090
+ - All dates are converted to UTC when serializing. Even timezone naïve values,
2091
+ which are treated as UTC with an offset of 0.
2092
+
2093
+ .. ipython :: python:
2094
+
2095
+ from pandas.io.json import build_table_schema
2096
+ s = pd.Series(pd.date_range('2016', periods=4))
2097
+ build_table_schema(s)
2098
+
2099
+ - datetimes with a timezone (before serializing), include an additional field
2100
+ ``tz `` with the time zone name (e.g. ``'US/Central' ``).
2101
+
2102
+ .. ipython :: python
2103
+
2104
+ s_tz = pd.Series(pd.date_range(' 2016' , periods = 12 ,
2105
+ tz = ' US/Central' ))
2106
+ build_table_schema(s_tz)
2107
+
2108
+ - Periods are converted to timestamps before serialization, and so have the
2109
+ same behavior of being converted to UTC. In addition, periods will contain
2110
+ and additional field ``freq `` with the period's frequency, e.g. ``'A-DEC' ``
2111
+
2112
+ .. ipython :: python
2113
+
2114
+ s_per = pd.Series(1 , index = pd.period_range(' 2016' , freq = ' A-DEC' ,
2115
+ periods = 4 ))
2116
+ build_table_schema(s_per)
2117
+
2118
+ - Categoricals use the ``any `` type and an ``enum `` constraint listing
2119
+ the set of possible values. Additionally, an ``ordered `` field is included
2120
+
2121
+ .. ipython :: python
2122
+
2123
+ s_cat = pd.Series(pd.Categorical([' a' , ' b' , ' a' ]))
2124
+ build_table_schema(s_cat)
2125
+
2126
+ - A ``primaryKey `` field, containing an array of labels, is included
2127
+ *if the index is unique *:
2128
+
2129
+ .. ipython :: python
2130
+
2131
+ s_dupe = pd.Series([1 , 2 ], index = [1 , 1 ])
2132
+ build_table_schema(s_dupe)
2133
+
2134
+ - The ``primaryKey `` behavior is the same with MultiIndexes, but in this
2135
+ case the ``primaryKey `` is an array:
2136
+
2137
+ .. ipython :: python
2138
+
2139
+ s_multi = pd.Series(1 , index = pd.MultiIndex.from_product([(' a' , ' b' ),
2140
+ (0 , 1 )]))
2141
+ build_table_schema(s_multi)
2142
+
2143
+ - The default naming roughly follows these rules:
2144
+
2145
+ + For series, the ``object.name `` is used. If that's none, then the
2146
+ name is ``values ``
2147
+ + For DataFrames, the stringified version of the column name is used
2148
+ + For ``Index `` (not ``MultiIndex ``), ``index.name `` is used, with a
2149
+ fallback to ``index `` if that is None.
2150
+ + For ``MultiIndex ``, ``mi.names `` is used. If any level has no name,
2151
+ then ``level_<i> `` is used.
2152
+
2153
+
2154
+ _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2155
+
2036
2156
HTML
2037
2157
----
2038
2158
0 commit comments