Skip to content

Commit e3251da

Browse files
WillAydjreback
authored andcommitted
read_json support for orient="table" (#19039)
1 parent e6ea00c commit e3251da

File tree

5 files changed

+331
-22
lines changed

5 files changed

+331
-22
lines changed

doc/source/io.rst

+40-3
Original file line numberDiff line numberDiff line change
@@ -1648,7 +1648,7 @@ with optional parameters:
16481648

16491649
DataFrame
16501650
- default is ``columns``
1651-
- allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``}
1651+
- allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``}
16521652

16531653
The format of the JSON string
16541654

@@ -1732,6 +1732,9 @@ values, index and columns. Name is also included for ``Series``:
17321732
dfjo.to_json(orient="split")
17331733
sjo.to_json(orient="split")
17341734
1735+
**Table oriented** serializes to the JSON `Table Schema`_, allowing for the
1736+
preservation of metadata including but not limited to dtypes and index names.
1737+
17351738
.. note::
17361739

17371740
Any orient option that encodes to a JSON object will not preserve the ordering of
@@ -1833,7 +1836,7 @@ is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series``
18331836

18341837
DataFrame
18351838
- default is ``columns``
1836-
- allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``}
1839+
- allowed values are {``split``, ``records``, ``index``, ``columns``, ``values``, ``table``}
18371840

18381841
The format of the JSON string
18391842

@@ -1846,6 +1849,8 @@ is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series``
18461849
``index``; dict like {index -> {column -> value}}
18471850
``columns``; dict like {column -> {index -> value}}
18481851
``values``; just the values array
1852+
``table``; adhering to the JSON `Table Schema`_
1853+
18491854

18501855
- ``dtype`` : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don't infer dtypes at all, default is True, apply only to the data
18511856
- ``convert_axes`` : boolean, try to convert the axes to the proper dtypes, default is True
@@ -2202,7 +2207,39 @@ A few notes on the generated table schema:
22022207
then ``level_<i>`` is used.
22032208

22042209

2205-
_Table Schema: http://specs.frictionlessdata.io/json-table-schema/
2210+
.. versionadded:: 0.23.0
2211+
2212+
``read_json`` also accepts ``orient='table'`` as an argument. This allows for
2213+
the preserveration of metadata such as dtypes and index names in a
2214+
round-trippable manner.
2215+
2216+
.. ipython:: python
2217+
2218+
df = pd.DataFrame({'foo': [1, 2, 3, 4],
2219+
'bar': ['a', 'b', 'c', 'd'],
2220+
'baz': pd.date_range('2018-01-01', freq='d', periods=4),
2221+
'qux': pd.Categorical(['a', 'b', 'c', 'c'])
2222+
}, index=pd.Index(range(4), name='idx'))
2223+
df
2224+
df.dtypes
2225+
2226+
df.to_json('test.json', orient='table')
2227+
new_df = pd.read_json('test.json', orient='table')
2228+
new_df
2229+
new_df.dtypes
2230+
2231+
Please note that the string `index` is not supported with the round trip
2232+
format, as it is used by default in ``write_json`` to indicate a missing index
2233+
name.
2234+
2235+
.. ipython:: python
2236+
2237+
df.index.name = 'index'
2238+
df.to_json('test.json', orient='table')
2239+
new_df = pd.read_json('test.json', orient='table')
2240+
print(new_df.index.name)
2241+
2242+
.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
22062243

22072244
HTML
22082245
----

doc/source/whatsnew/v0.23.0.txt

+31
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,37 @@ Current Behavior
145145

146146
s.rank(na_option='top')
147147

148+
.. _whatsnew_0230.enhancements.round-trippable_json:
149+
150+
JSON read/write round-trippable with ``orient='table'``
151+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
152+
153+
A ``DataFrame`` can now be written to and subsequently read back via JSON while preserving metadata through usage of the ``orient='table'`` argument (see :issue:`18912` and :issue:`9146`). Previously, none of the available ``orient`` values guaranteed the preservation of dtypes and index names, amongst other metadata.
154+
155+
.. ipython:: python
156+
157+
df = pd.DataFrame({'foo': [1, 2, 3, 4],
158+
'bar': ['a', 'b', 'c', 'd'],
159+
'baz': pd.date_range('2018-01-01', freq='d', periods=4),
160+
'qux': pd.Categorical(['a', 'b', 'c', 'c'])
161+
}, index=pd.Index(range(4), name='idx'))
162+
df
163+
df.dtypes
164+
df.to_json('test.json', orient='table')
165+
new_df = pd.read_json('test.json', orient='table')
166+
new_df
167+
new_df.dtypes
168+
169+
Please note that the string `index` is not supported with the round trip format, as it is used by default in ``write_json`` to indicate a missing index name.
170+
171+
.. ipython:: python
172+
173+
df.index.name = 'index'
174+
df.to_json('test.json', orient='table')
175+
new_df = pd.read_json('test.json', orient='table')
176+
new_df
177+
print(new_df.index.name)
178+
148179
.. _whatsnew_0230.enhancements.other:
149180

150181
Other Enhancements

pandas/io/json/json.py

+17-2
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
from pandas.core.reshape.concat import concat
1717
from pandas.io.formats.printing import pprint_thing
1818
from .normalize import _convert_to_line_delimits
19-
from .table_schema import build_table_schema
19+
from .table_schema import build_table_schema, parse_table_schema
2020
from pandas.core.dtypes.common import is_period_dtype
2121

2222
loads = json.loads
@@ -261,13 +261,16 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
261261
* when ``typ == 'frame'``,
262262
263263
- allowed orients are ``{'split','records','index',
264-
'columns','values'}``
264+
'columns','values', 'table'}``
265265
- default is ``'columns'``
266266
- The DataFrame index must be unique for orients ``'index'`` and
267267
``'columns'``.
268268
- The DataFrame columns must be unique for orients ``'index'``,
269269
``'columns'``, and ``'records'``.
270270
271+
.. versionadded:: 0.23.0
272+
'table' as an allowed value for the ``orient`` argument
273+
271274
typ : type of object to recover (series or frame), default 'frame'
272275
dtype : boolean or dict, default True
273276
If True, infer dtypes, if a dict of column to dtype, then use those,
@@ -336,6 +339,15 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True,
336339
-------
337340
result : Series or DataFrame, depending on the value of `typ`.
338341
342+
Notes
343+
-----
344+
Specific to ``orient='table'``, if a ``DataFrame`` with a literal ``Index``
345+
name of `index` gets written with ``write_json``, the subsequent read
346+
operation will incorrectly set the ``Index`` name to ``None``. This is
347+
because `index` is also used by ``write_json`` to denote a missing
348+
``Index`` name, and the subsequent ``read_json`` operation cannot
349+
distinguish between the two.
350+
339351
See Also
340352
--------
341353
DataFrame.to_json
@@ -839,6 +851,9 @@ def _parse_no_numpy(self):
839851
elif orient == "index":
840852
self.obj = DataFrame(
841853
loads(json, precise_float=self.precise_float), dtype=None).T
854+
elif orient == 'table':
855+
self.obj = parse_table_schema(json,
856+
precise_float=self.precise_float)
842857
else:
843858
self.obj = DataFrame(
844859
loads(json, precise_float=self.precise_float), dtype=None)

pandas/io/json/table_schema.py

+135-5
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,18 @@
33
44
http://specs.frictionlessdata.io/json-table-schema/
55
"""
6+
import pandas._libs.json as json
7+
from pandas import DataFrame
8+
from pandas.api.types import CategoricalDtype
69
from pandas.core.common import _all_not_none
710
from pandas.core.dtypes.common import (
811
is_integer_dtype, is_timedelta64_dtype, is_numeric_dtype,
912
is_bool_dtype, is_datetime64_dtype, is_datetime64tz_dtype,
1013
is_categorical_dtype, is_period_dtype, is_string_dtype
1114
)
1215

16+
loads = json.loads
17+
1318

1419
def as_json_table_type(x):
1520
"""
@@ -75,7 +80,7 @@ def set_default_names(data):
7580
return data
7681

7782

78-
def make_field(arr, dtype=None):
83+
def convert_pandas_type_to_json_field(arr, dtype=None):
7984
dtype = dtype or arr.dtype
8085
if arr.name is None:
8186
name = 'values'
@@ -103,6 +108,69 @@ def make_field(arr, dtype=None):
103108
return field
104109

105110

111+
def convert_json_field_to_pandas_type(field):
112+
"""
113+
Converts a JSON field descriptor into its corresponding NumPy / pandas type
114+
115+
Parameters
116+
----------
117+
field
118+
A JSON field descriptor
119+
120+
Returns
121+
-------
122+
dtype
123+
124+
Raises
125+
-----
126+
ValueError
127+
If the type of the provided field is unknown or currently unsupported
128+
129+
Examples
130+
--------
131+
>>> convert_json_field_to_pandas_type({'name': 'an_int',
132+
'type': 'integer'})
133+
'int64'
134+
>>> convert_json_field_to_pandas_type({'name': 'a_categorical',
135+
'type': 'any',
136+
'contraints': {'enum': [
137+
'a', 'b', 'c']},
138+
'ordered': True})
139+
'CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)'
140+
>>> convert_json_field_to_pandas_type({'name': 'a_datetime',
141+
'type': 'datetime'})
142+
'datetime64[ns]'
143+
>>> convert_json_field_to_pandas_type({'name': 'a_datetime_with_tz',
144+
'type': 'datetime',
145+
'tz': 'US/Central'})
146+
'datetime64[ns, US/Central]'
147+
"""
148+
typ = field['type']
149+
if typ == 'string':
150+
return 'object'
151+
elif typ == 'integer':
152+
return 'int64'
153+
elif typ == 'number':
154+
return 'float64'
155+
elif typ == 'boolean':
156+
return 'bool'
157+
elif typ == 'duration':
158+
return 'timedelta64'
159+
elif typ == 'datetime':
160+
if field.get('tz'):
161+
return 'datetime64[ns, {tz}]'.format(tz=field['tz'])
162+
else:
163+
return 'datetime64[ns]'
164+
elif typ == 'any':
165+
if 'constraints' in field and 'ordered' in field:
166+
return CategoricalDtype(categories=field['constraints']['enum'],
167+
ordered=field['ordered'])
168+
else:
169+
return 'object'
170+
171+
raise ValueError("Unsupported or invalid field type: {}".format(typ))
172+
173+
106174
def build_table_schema(data, index=True, primary_key=None, version=True):
107175
"""
108176
Create a Table schema from ``data``.
@@ -158,15 +226,15 @@ def build_table_schema(data, index=True, primary_key=None, version=True):
158226
if index:
159227
if data.index.nlevels > 1:
160228
for level in data.index.levels:
161-
fields.append(make_field(level))
229+
fields.append(convert_pandas_type_to_json_field(level))
162230
else:
163-
fields.append(make_field(data.index))
231+
fields.append(convert_pandas_type_to_json_field(data.index))
164232

165233
if data.ndim > 1:
166234
for column, s in data.iteritems():
167-
fields.append(make_field(s))
235+
fields.append(convert_pandas_type_to_json_field(s))
168236
else:
169-
fields.append(make_field(data))
237+
fields.append(convert_pandas_type_to_json_field(data))
170238

171239
schema['fields'] = fields
172240
if index and data.index.is_unique and primary_key is None:
@@ -180,3 +248,65 @@ def build_table_schema(data, index=True, primary_key=None, version=True):
180248
if version:
181249
schema['pandas_version'] = '0.20.0'
182250
return schema
251+
252+
253+
def parse_table_schema(json, precise_float):
254+
"""
255+
Builds a DataFrame from a given schema
256+
257+
Parameters
258+
----------
259+
json :
260+
A JSON table schema
261+
precise_float : boolean
262+
Flag controlling precision when decoding string to double values, as
263+
dictated by ``read_json``
264+
265+
Returns
266+
-------
267+
df : DataFrame
268+
269+
Raises
270+
------
271+
NotImplementedError
272+
If the JSON table schema contains either timezone or timedelta data
273+
274+
Notes
275+
-----
276+
Because ``write_json`` uses the string `index` to denote a name-less
277+
``Index``, this function sets the name of the returned ``DataFrame`` to
278+
``None`` when said string is encountered. Therefore, intentional usage
279+
of `index` as the ``Index`` name is not supported.
280+
281+
See also
282+
--------
283+
build_table_schema : inverse function
284+
pandas.read_json
285+
"""
286+
table = loads(json, precise_float=precise_float)
287+
col_order = [field['name'] for field in table['schema']['fields']]
288+
df = DataFrame(table['data'])[col_order]
289+
290+
dtypes = {field['name']: convert_json_field_to_pandas_type(field)
291+
for field in table['schema']['fields']}
292+
293+
# Cannot directly use as_type with timezone data on object; raise for now
294+
if any(str(x).startswith('datetime64[ns, ') for x in dtypes.values()):
295+
raise NotImplementedError('table="orient" can not yet read timezone '
296+
'data')
297+
298+
# No ISO constructor for Timedelta as of yet, so need to raise
299+
if 'timedelta64' in dtypes.values():
300+
raise NotImplementedError('table="orient" can not yet read '
301+
'ISO-formatted Timedelta data')
302+
303+
df = df.astype(dtypes)
304+
305+
df = df.set_index(table['schema']['primaryKey'])
306+
if len(df.index.names) == 1 and df.index.name == 'index':
307+
df.index.name = None
308+
else:
309+
if all(x.startswith('level_') for x in df.index.names):
310+
df.index.names = [None] * len(df.index.names)
311+
312+
return df

0 commit comments

Comments
 (0)