Skip to content

Commit 08f82a1

Browse files
wesmpcluo
authored andcommitted
ENH: Draft metadata specification doc for Apache Parquet (pandas-dev#16315)
* Draft metadata specification doc for Apache Parquet * Tweaks, add pandas version * Relax metadata key * Be explicit that the metadata is file-level * Don't hard code version * Code reviews * Move Parquet metadata to developer.rst, account for code reviews * Code review comments * Review comments * Fix typo
1 parent 8b98cc2 commit 08f82a1

File tree

1 file changed

+117
-0
lines changed

1 file changed

+117
-0
lines changed

doc/source/developer.rst

+117
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,120 @@ Developer
1616
*********
1717

1818
This section will focus on downstream applications of pandas.
19+
20+
.. _apache.parquet:
21+
22+
Storing pandas DataFrame objects in Apache Parquet format
23+
---------------------------------------------------------
24+
25+
The `Apache Parquet <https://github.com/apache/parquet-format>`__ format
26+
provides key-value metadata at the file and column level, stored in the footer
27+
of the Parquet file:
28+
29+
.. code-block:: shell
30+
31+
5: optional list<KeyValue> key_value_metadata
32+
33+
where ``KeyValue`` is
34+
35+
.. code-block:: shell
36+
37+
struct KeyValue {
38+
1: required string key
39+
2: optional string value
40+
}
41+
42+
So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
43+
``pandas`` metadata key in the ``FileMetaData`` with the the value stored as :
44+
45+
.. code-block:: text
46+
47+
{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
48+
'columns': [<c0>, <c1>, ...],
49+
'pandas_version': $VERSION}
50+
51+
Here, ``<c0>`` and so forth are dictionaries containing the metadata for each
52+
column. This has JSON form:
53+
54+
.. code-block:: text
55+
56+
{'name': column_name,
57+
'pandas_type': pandas_type,
58+
'numpy_type': numpy_type,
59+
'metadata': type_metadata}
60+
61+
``pandas_type`` is the logical type of the column, and is one of:
62+
63+
* Boolean: ``'bool'``
64+
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
65+
* Floats: ``'float16', 'float32', 'float64'``
66+
* Date and Time Types: ``'datetime', 'datetimetz'``, ``'timedelta'``
67+
* String: ``'unicode', 'bytes'``
68+
* Categorical: ``'categorical'``
69+
* Other Python objects: ``'object'``
70+
71+
The ``numpy_type`` is the physical storage type of the column, which is the
72+
result of ``str(dtype)`` for the underlying NumPy array that holds the data. So
73+
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be
74+
any of the supported integer categorical types.
75+
76+
The ``type_metadata`` is ``None`` except for:
77+
78+
* ``datetimetz``: ``{'timezone': zone, 'unit': 'ns'}``, e.g. ``{'timezone',
79+
'America/New_York', 'unit': 'ns'}``. The ``'unit'`` is optional, and if
80+
omitted it is assumed to be nanoseconds.
81+
* ``categorical``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}``
82+
83+
* Here ``'type'`` is optional, and can be a nested pandas type specification
84+
here (but not categorical)
85+
86+
* ``unicode``: ``{'encoding': encoding}``
87+
88+
* The encoding is optional, and if not present is UTF-8
89+
90+
* ``object``: ``{'encoding': encoding}``. Objects can be serialized and stored
91+
in ``BYTE_ARRAY`` Parquet columns. The encoding can be one of:
92+
93+
* ``'pickle'``
94+
* ``'msgpack'``
95+
* ``'bson'``
96+
* ``'json'``
97+
98+
* ``timedelta``: ``{'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted
99+
it is assumed to be nanoseconds. This metadata is optional altogether
100+
101+
For types other than these, the ``'metadata'`` key can be
102+
omitted. Implementations can assume ``None`` if the key is not present.
103+
104+
As an example of fully-formed metadata:
105+
106+
.. code-block:: text
107+
108+
{'index_columns': ['__index_level_0__'],
109+
'columns': [
110+
{'name': 'c0',
111+
'pandas_type': 'int8',
112+
'numpy_type': 'int8',
113+
'metadata': None},
114+
{'name': 'c1',
115+
'pandas_type': 'bytes',
116+
'numpy_type': 'object',
117+
'metadata': None},
118+
{'name': 'c2',
119+
'pandas_type': 'categorical',
120+
'numpy_type': 'int16',
121+
'metadata': {'num_categories': 1000, 'ordered': False}},
122+
{'name': 'c3',
123+
'pandas_type': 'datetimetz',
124+
'numpy_type': 'datetime64[ns]',
125+
'metadata': {'timezone': 'America/Los_Angeles'}},
126+
{'name': 'c4',
127+
'pandas_type': 'object',
128+
'numpy_type': 'object',
129+
'metadata': {'encoding': 'pickle'}},
130+
{'name': '__index_level_0__',
131+
'pandas_type': 'int64',
132+
'numpy_type': 'int64',
133+
'metadata': None}
134+
],
135+
'pandas_version': '0.20.0'}

0 commit comments

Comments
 (0)