@@ -16,3 +16,107 @@ Developer
16
16
*********
17
17
18
18
This section will focus on downstream applications of pandas.
19
+
20
+ Storing pandas DataFrame objects in Apache Parquet format
21
+ ---------------------------------------------------------
22
+
23
+ The `Apache Parquet <https://github.com/apache/parquet-format >`__ format
24
+ provides key-value metadata at the file and column level, stored in the footer
25
+ of the Parquet file:
26
+
27
+ .. code-block :: shell
28
+
29
+ 5: optional list< KeyValue> key_value_metadata
30
+
31
+ where ``KeyValue `` is
32
+
33
+ .. code-block :: shell
34
+
35
+ struct KeyValue {
36
+ 1: required string key
37
+ 2: optional string value
38
+ }
39
+
40
+ So that a ``pandas.DataFrame `` can be faithfully reconstructed, we store a
41
+ ``pandas `` metadata key in the ``FileMetaData `` with the the value stored as :
42
+
43
+ .. code-block :: text
44
+
45
+ {'index_columns': ['__index_level_0__', '__index_level_1__', ...],
46
+ 'columns': [<c0>, <c1>, ...],
47
+ 'pandas_version': $VERSION}
48
+
49
+ Here, ``<c0> `` and so forth are dictionaries containing the metadata for each
50
+ column. This has JSON form:
51
+
52
+ .. code-block :: text
53
+
54
+ {'name': column_name,
55
+ 'pandas_type': pandas_type,
56
+ 'numpy_dtype': numpy_type,
57
+ 'metadata': type_metadata}
58
+
59
+ ``pandas_type `` is the logical type of the column, and is one of:
60
+
61
+ * Boolean: ``'bool' ``
62
+ * Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64' ``
63
+ * Floats: ``'float16', 'float32', 'float64' ``
64
+ * Datetime: ``'datetime', 'datetimetz' ``
65
+ * String: ``'unicode', 'bytes' ``
66
+ * Categorical: ``'categorical' ``
67
+ * Other Python objects: ``'object' ``
68
+
69
+ The ``numpy_type `` is the physical storage type of the column, which is the
70
+ result of ``str(dtype) `` for the underlying NumPy array that holds the data. So
71
+ for ``datetimetz `` this is ``datetime64[ns] `` and for categorical, it may be
72
+ any of the supported integer categorical types.
73
+
74
+ The ``type_metadata `` is ``None `` except for:
75
+
76
+ * ``datetimetz ``: ``{'timezone': zone} ``, e.g. ``{'timezone', 'America/New_York'} ``
77
+ * ``categorical ``: ``{'num_categories': K, 'ordered': is_ordered} ``
78
+ * ``object ``: ``{'encoding': encoding} ``
79
+
80
+ Objects can be serialized and stored in ``BYTE_ARRAY `` Parquet columns. The
81
+ encoding can be one of:
82
+
83
+ * ``'pickle' ``
84
+ * ``'msgpack' ``
85
+ * ``'bson' ``
86
+ * ``'json' ``
87
+
88
+ For types other than these, the ``'metadata' `` key can be
89
+ omitted. Implementations can assume ``None `` if the key is not present.
90
+
91
+ As an example of fully-formed metadata:
92
+
93
+ .. code-block :: text
94
+
95
+ {'index_columns': ['__index_level_0__'],
96
+ 'columns': [
97
+ {'name': 'c0',
98
+ 'pandas_type': 'int8',
99
+ 'numpy_type': 'int8',
100
+ 'metadata': None},
101
+ {'name': 'c1',
102
+ 'pandas_type': 'bytes',
103
+ 'numpy_type': 'object',
104
+ 'metadata': None},
105
+ {'name': 'c2',
106
+ 'pandas_type': 'categorical',
107
+ 'numpy_type': 'int16',
108
+ 'metadata': {'num_categories': 1000, 'ordered': False}},
109
+ {'name': 'c3',
110
+ 'pandas_type': 'datetimetz',
111
+ 'numpy_type': 'datetime64[ns]',
112
+ 'metadata': {'timezone': 'America/Los_Angeles'}},
113
+ {'name': 'c4',
114
+ 'pandas_type': 'object',
115
+ 'numpy_type': 'object',
116
+ 'metadata': {'encoding': 'pickle'}},
117
+ {'name': '__index_level_0__',
118
+ 'pandas_type': 'int64',
119
+ 'numpy_type': 'int64',
120
+ 'metadata': None}
121
+ ],
122
+ 'pandas_version': '0.20.0'}
0 commit comments