Skip to content

Commit e0a176e

Browse files
committed
Move Parquet metadata to developer.rst, account for code reviews
1 parent d2c66d8 commit e0a176e

File tree

2 files changed

+104
-100
lines changed

2 files changed

+104
-100
lines changed

doc/source/developer.rst

+104
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,107 @@ Developer
1616
*********
1717

1818
This section will focus on downstream applications of pandas.
19+
20+
Storing pandas DataFrame objects in Apache Parquet format
21+
---------------------------------------------------------
22+
23+
The `Apache Parquet <https://github.com/apache/parquet-format>`__ format
24+
provides key-value metadata at the file and column level, stored in the footer
25+
of the Parquet file:
26+
27+
.. code-block:: shell
28+
29+
5: optional list<KeyValue> key_value_metadata
30+
31+
where ``KeyValue`` is
32+
33+
.. code-block:: shell
34+
35+
struct KeyValue {
36+
1: required string key
37+
2: optional string value
38+
}
39+
40+
So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
41+
``pandas`` metadata key in the ``FileMetaData`` with the the value stored as :
42+
43+
.. code-block:: text
44+
45+
{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
46+
'columns': [<c0>, <c1>, ...],
47+
'pandas_version': $VERSION}
48+
49+
Here, ``<c0>`` and so forth are dictionaries containing the metadata for each
50+
column. This has JSON form:
51+
52+
.. code-block:: text
53+
54+
{'name': column_name,
55+
'pandas_type': pandas_type,
56+
'numpy_dtype': numpy_type,
57+
'metadata': type_metadata}
58+
59+
``pandas_type`` is the logical type of the column, and is one of:
60+
61+
* Boolean: ``'bool'``
62+
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
63+
* Floats: ``'float16', 'float32', 'float64'``
64+
* Datetime: ``'datetime', 'datetimetz'``
65+
* String: ``'unicode', 'bytes'``
66+
* Categorical: ``'categorical'``
67+
* Other Python objects: ``'object'``
68+
69+
The ``numpy_type`` is the physical storage type of the column, which is the
70+
result of ``str(dtype)`` for the underlying NumPy array that holds the data. So
71+
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be
72+
any of the supported integer categorical types.
73+
74+
The ``type_metadata`` is ``None`` except for:
75+
76+
* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``
77+
* ``categorical``: ``{'num_categories': K, 'ordered': is_ordered}``
78+
* ``object``: ``{'encoding': encoding}``
79+
80+
Objects can be serialized and stored in ``BYTE_ARRAY`` Parquet columns. The
81+
encoding can be one of:
82+
83+
* ``'pickle'``
84+
* ``'msgpack'``
85+
* ``'bson'``
86+
* ``'json'``
87+
88+
For types other than these, the ``'metadata'`` key can be
89+
omitted. Implementations can assume ``None`` if the key is not present.
90+
91+
As an example of fully-formed metadata:
92+
93+
.. code-block:: text
94+
95+
{'index_columns': ['__index_level_0__'],
96+
'columns': [
97+
{'name': 'c0',
98+
'pandas_type': 'int8',
99+
'numpy_type': 'int8',
100+
'metadata': None},
101+
{'name': 'c1',
102+
'pandas_type': 'bytes',
103+
'numpy_type': 'object',
104+
'metadata': None},
105+
{'name': 'c2',
106+
'pandas_type': 'categorical',
107+
'numpy_type': 'int16',
108+
'metadata': {'num_categories': 1000, 'ordered': False}},
109+
{'name': 'c3',
110+
'pandas_type': 'datetimetz',
111+
'numpy_type': 'datetime64[ns]',
112+
'metadata': {'timezone': 'America/Los_Angeles'}},
113+
{'name': 'c4',
114+
'pandas_type': 'object',
115+
'numpy_type': 'object',
116+
'metadata': {'encoding': 'pickle'}},
117+
{'name': '__index_level_0__',
118+
'pandas_type': 'int64',
119+
'numpy_type': 'int64',
120+
'metadata': None}
121+
],
122+
'pandas_version': '0.20.0'}

doc/source/metadata.rst

-100
This file was deleted.

0 commit comments

Comments
 (0)