Skip to content

Commit c911151

Browse files
cpclouddatapythonista
authored andcommitted
DOC: Update parquet metadata format description around index levels (#18201)
1 parent fc64ca8 commit c911151

File tree

1 file changed

+33
-4
lines changed

1 file changed

+33
-4
lines changed

doc/source/developer.rst

+33-4
Original file line numberDiff line numberDiff line change
@@ -41,15 +41,37 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
4141
'pandas_version': $VERSION}
4242
4343
Here, ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
44-
for each column. This has JSON form:
44+
for each column, *including the index columns*. This has JSON form:
4545

4646
.. code-block:: text
4747
4848
{'name': column_name,
49+
'field_name': parquet_column_name,
4950
'pandas_type': pandas_type,
5051
'numpy_type': numpy_type,
5152
'metadata': metadata}
5253
54+
.. note::
55+
56+
Every index column is stored with a name matching the pattern
57+
``__index_level_\d+__`` and its corresponding column information is can be
58+
found with the following code snippet.
59+
60+
Following this naming convention isn't strictly necessary, but strongly
61+
suggested for compatibility with Arrow.
62+
63+
Here's an example of how the index metadata is structured in pyarrow:
64+
65+
.. code-block:: python
66+
67+
# assuming there's at least 3 levels in the index
68+
index_columns = metadata['index_columns']
69+
columns = metadata['columns']
70+
ith_index = 2
71+
assert index_columns[ith_index] == '__index_level_2__'
72+
ith_index_info = columns[-len(index_columns):][ith_index]
73+
ith_index_level_name = ith_index_info['name']
74+
5375
``pandas_type`` is the logical type of the column, and is one of:
5476

5577
* Boolean: ``'bool'``
@@ -100,32 +122,39 @@ As an example of fully-formed metadata:
100122
{'index_columns': ['__index_level_0__'],
101123
'column_indexes': [
102124
{'name': None,
103-
'pandas_type': 'string',
125+
'field_name': 'None',
126+
'pandas_type': 'unicode',
104127
'numpy_type': 'object',
105-
'metadata': None}
128+
'metadata': {'encoding': 'UTF-8'}}
106129
],
107130
'columns': [
108131
{'name': 'c0',
132+
'field_name': 'c0',
109133
'pandas_type': 'int8',
110134
'numpy_type': 'int8',
111135
'metadata': None},
112136
{'name': 'c1',
137+
'field_name': 'c1',
113138
'pandas_type': 'bytes',
114139
'numpy_type': 'object',
115140
'metadata': None},
116141
{'name': 'c2',
142+
'field_name': 'c2',
117143
'pandas_type': 'categorical',
118144
'numpy_type': 'int16',
119145
'metadata': {'num_categories': 1000, 'ordered': False}},
120146
{'name': 'c3',
147+
'field_name': 'c3',
121148
'pandas_type': 'datetimetz',
122149
'numpy_type': 'datetime64[ns]',
123150
'metadata': {'timezone': 'America/Los_Angeles'}},
124151
{'name': 'c4',
152+
'field_name': 'c4',
125153
'pandas_type': 'object',
126154
'numpy_type': 'object',
127155
'metadata': {'encoding': 'pickle'}},
128-
{'name': '__index_level_0__',
156+
{'name': None,
157+
'field_name': '__index_level_0__',
129158
'pandas_type': 'int64',
130159
'numpy_type': 'int64',
131160
'metadata': None}

0 commit comments

Comments
 (0)