Skip to content

Commit 45cc4ad

Browse files
committed
Merge remote-tracking branch 'upstream/master' into IO_formatting
2 parents ba1b185 + 07ac39e commit 45cc4ad

34 files changed

+1776
-266
lines changed

ci/requirements-2.7.pip

+2
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ pathlib
44
backports.lzma
55
py
66
PyCrypto
7+
mock
8+
ipython

ci/requirements-3.5.run

+1
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ pymysql
1818
psycopg2
1919
s3fs
2020
beautifulsoup4
21+
ipython

ci/requirements-3.6.run

+1
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,4 @@ pymysql
1818
beautifulsoup4
1919
s3fs
2020
xarray
21+
ipython

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ JSON
6060
:toctree: generated/
6161

6262
json_normalize
63+
build_table_schema
6364

6465
.. currentmodule:: pandas
6566

doc/source/comparison_with_r.rst

-8
Original file line numberDiff line numberDiff line change
@@ -206,14 +206,6 @@ of its first argument in its second:
206206
s <- 0:4
207207
match(s, c(2,4))
208208
209-
The :meth:`~pandas.core.groupby.GroupBy.apply` method can be used to replicate
210-
this:
211-
212-
.. ipython:: python
213-
214-
s = pd.Series(np.arange(5),dtype=np.float32)
215-
pd.Series(pd.match(s,[2,4],np.nan))
216-
217209
For more details and examples see :ref:`the reshaping documentation
218210
<indexing.basics.indexing_isin>`.
219211

doc/source/io.rst

+120
Original file line numberDiff line numberDiff line change
@@ -2033,6 +2033,126 @@ using Hadoop or Spark.
20332033
df
20342034
df.to_json(orient='records', lines=True)
20352035
2036+
2037+
.. _io.table_schema:
2038+
2039+
Table Schema
2040+
''''''''''''
2041+
2042+
.. versionadded:: 0.20.0
2043+
2044+
`Table Schema`_ is a spec for describing tabular datasets as a JSON
2045+
object. The JSON includes information on the field names, types, and
2046+
other attributes. You can use the orient ``table`` to build
2047+
a JSON string with two fields, ``schema`` and ``data``.
2048+
2049+
.. ipython:: python
2050+
2051+
df = pd.DataFrame(
2052+
{'A': [1, 2, 3],
2053+
'B': ['a', 'b', 'c'],
2054+
'C': pd.date_range('2016-01-01', freq='d', periods=3),
2055+
}, index=pd.Index(range(3), name='idx'))
2056+
df
2057+
df.to_json(orient='table', date_format="iso")
2058+
2059+
The ``schema`` field contains the ``fields`` key, which itself contains
2060+
a list of column name to type pairs, including the ``Index`` or ``MultiIndex``
2061+
(see below for a list of types).
2062+
The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index
2063+
is unique.
2064+
2065+
The second field, ``data``, contains the serialized data with the ``records``
2066+
orient.
2067+
The index is included, and any datetimes are ISO 8601 formatted, as required
2068+
by the Table Schema spec.
2069+
2070+
The full list of types supported are described in the Table Schema
2071+
spec. This table shows the mapping from pandas types:
2072+
2073+
============== =================
2074+
Pandas type Table Schema type
2075+
============== =================
2076+
int64 integer
2077+
float64 number
2078+
bool boolean
2079+
datetime64[ns] datetime
2080+
timedelta64[ns] duration
2081+
categorical any
2082+
object str
2083+
=============== =================
2084+
2085+
A few notes on the generated table schema:
2086+
2087+
- The ``schema`` object contains a ``pandas_version`` field. This contains
2088+
the version of pandas' dialect of the schema, and will be incremented
2089+
with each revision.
2090+
- All dates are converted to UTC when serializing. Even timezone naïve values,
2091+
which are treated as UTC with an offset of 0.
2092+
2093+
.. ipython:: python:
2094+
2095+
from pandas.io.json import build_table_schema
2096+
s = pd.Series(pd.date_range('2016', periods=4))
2097+
build_table_schema(s)
2098+
2099+
- datetimes with a timezone (before serializing), include an additional field
2100+
``tz`` with the time zone name (e.g. ``'US/Central'``).
2101+
2102+
.. ipython:: python
2103+
2104+
s_tz = pd.Series(pd.date_range('2016', periods=12,
2105+
tz='US/Central'))
2106+
build_table_schema(s_tz)
2107+
2108+
- Periods are converted to timestamps before serialization, and so have the
2109+
same behavior of being converted to UTC. In addition, periods will contain
2110+
and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``
2111+
2112+
.. ipython:: python
2113+
2114+
s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
2115+
periods=4))
2116+
build_table_schema(s_per)
2117+
2118+
- Categoricals use the ``any`` type and an ``enum`` constraint listing
2119+
the set of possible values. Additionally, an ``ordered`` field is included
2120+
2121+
.. ipython:: python
2122+
2123+
s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
2124+
build_table_schema(s_cat)
2125+
2126+
- A ``primaryKey`` field, containing an array of labels, is included
2127+
*if the index is unique*:
2128+
2129+
.. ipython:: python
2130+
2131+
s_dupe = pd.Series([1, 2], index=[1, 1])
2132+
build_table_schema(s_dupe)
2133+
2134+
- The ``primaryKey`` behavior is the same with MultiIndexes, but in this
2135+
case the ``primaryKey`` is an array:
2136+
2137+
.. ipython:: python
2138+
2139+
s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
2140+
(0, 1)]))
2141+
build_table_schema(s_multi)
2142+
2143+
- The default naming roughly follows these rules:
2144+
2145+
+ For series, the ``object.name`` is used. If that's none, then the
2146+
name is ``values``
2147+
+ For DataFrames, the stringified version of the column name is used
2148+
+ For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
2149+
fallback to ``index`` if that is None.
2150+
+ For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
2151+
then ``level_<i>`` is used.
2152+
2153+
2154+
_Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2155+
20362156
HTML
20372157
----
20382158

0 commit comments

Comments
 (0)