@@ -16,3 +16,120 @@ Developer
16
16
*********
17
17
18
18
This section will focus on downstream applications of pandas.
19
+
20
+ .. _apache.parquet :
21
+
22
+ Storing pandas DataFrame objects in Apache Parquet format
23
+ ---------------------------------------------------------
24
+
25
+ The `Apache Parquet <https://github.com/apache/parquet-format >`__ format
26
+ provides key-value metadata at the file and column level, stored in the footer
27
+ of the Parquet file:
28
+
29
+ .. code-block :: shell
30
+
31
+ 5: optional list< KeyValue> key_value_metadata
32
+
33
+ where ``KeyValue `` is
34
+
35
+ .. code-block :: shell
36
+
37
+ struct KeyValue {
38
+ 1: required string key
39
+ 2: optional string value
40
+ }
41
+
42
+ So that a ``pandas.DataFrame `` can be faithfully reconstructed, we store a
43
+ ``pandas `` metadata key in the ``FileMetaData `` with the the value stored as :
44
+
45
+ .. code-block :: text
46
+
47
+ {'index_columns': ['__index_level_0__', '__index_level_1__', ...],
48
+ 'columns': [<c0>, <c1>, ...],
49
+ 'pandas_version': $VERSION}
50
+
51
+ Here, ``<c0> `` and so forth are dictionaries containing the metadata for each
52
+ column. This has JSON form:
53
+
54
+ .. code-block :: text
55
+
56
+ {'name': column_name,
57
+ 'pandas_type': pandas_type,
58
+ 'numpy_type': numpy_type,
59
+ 'metadata': type_metadata}
60
+
61
+ ``pandas_type `` is the logical type of the column, and is one of:
62
+
63
+ * Boolean: ``'bool' ``
64
+ * Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64' ``
65
+ * Floats: ``'float16', 'float32', 'float64' ``
66
+ * Date and Time Types: ``'datetime', 'datetimetz' ``, ``'timedelta' ``
67
+ * String: ``'unicode', 'bytes' ``
68
+ * Categorical: ``'categorical' ``
69
+ * Other Python objects: ``'object' ``
70
+
71
+ The ``numpy_type `` is the physical storage type of the column, which is the
72
+ result of ``str(dtype) `` for the underlying NumPy array that holds the data. So
73
+ for ``datetimetz `` this is ``datetime64[ns] `` and for categorical, it may be
74
+ any of the supported integer categorical types.
75
+
76
+ The ``type_metadata `` is ``None `` except for:
77
+
78
+ * ``datetimetz ``: ``{'timezone': zone, 'unit': 'ns'} ``, e.g. ``{'timezone',
79
+ 'America/New_York', 'unit': 'ns'} ``. The ``'unit' `` is optional, and if
80
+ omitted it is assumed to be nanoseconds.
81
+ * ``categorical ``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE} ``
82
+
83
+ * Here ``'type' `` is optional, and can be a nested pandas type specification
84
+ here (but not categorical)
85
+
86
+ * ``unicode ``: ``{'encoding': encoding} ``
87
+
88
+ * The encoding is optional, and if not present is UTF-8
89
+
90
+ * ``object ``: ``{'encoding': encoding} ``. Objects can be serialized and stored
91
+ in ``BYTE_ARRAY `` Parquet columns. The encoding can be one of:
92
+
93
+ * ``'pickle' ``
94
+ * ``'msgpack' ``
95
+ * ``'bson' ``
96
+ * ``'json' ``
97
+
98
+ * ``timedelta ``: ``{'unit': 'ns'} ``. The ``'unit' `` is optional, and if omitted
99
+ it is assumed to be nanoseconds. This metadata is optional altogether
100
+
101
+ For types other than these, the ``'metadata' `` key can be
102
+ omitted. Implementations can assume ``None `` if the key is not present.
103
+
104
+ As an example of fully-formed metadata:
105
+
106
+ .. code-block :: text
107
+
108
+ {'index_columns': ['__index_level_0__'],
109
+ 'columns': [
110
+ {'name': 'c0',
111
+ 'pandas_type': 'int8',
112
+ 'numpy_type': 'int8',
113
+ 'metadata': None},
114
+ {'name': 'c1',
115
+ 'pandas_type': 'bytes',
116
+ 'numpy_type': 'object',
117
+ 'metadata': None},
118
+ {'name': 'c2',
119
+ 'pandas_type': 'categorical',
120
+ 'numpy_type': 'int16',
121
+ 'metadata': {'num_categories': 1000, 'ordered': False}},
122
+ {'name': 'c3',
123
+ 'pandas_type': 'datetimetz',
124
+ 'numpy_type': 'datetime64[ns]',
125
+ 'metadata': {'timezone': 'America/Los_Angeles'}},
126
+ {'name': 'c4',
127
+ 'pandas_type': 'object',
128
+ 'numpy_type': 'object',
129
+ 'metadata': {'encoding': 'pickle'}},
130
+ {'name': '__index_level_0__',
131
+ 'pandas_type': 'int64',
132
+ 'numpy_type': 'int64',
133
+ 'metadata': None}
134
+ ],
135
+ 'pandas_version': '0.20.0'}
0 commit comments