Skip to content

Commit 78c6843

Browse files
wesmjorisvandenbossche
authored andcommitted
DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema (#25709)
1 parent d320ef7 commit 78c6843

File tree

1 file changed

+41
-19
lines changed

1 file changed

+41
-19
lines changed

doc/source/development/developer.rst

+41-19
Original file line numberDiff line numberDiff line change
@@ -37,12 +37,19 @@ So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
3737

3838
.. code-block:: text
3939
40-
{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
40+
{'index_columns': [<descr0>, <descr1>, ...],
4141
'column_indexes': [<ci0>, <ci1>, ..., <ciN>],
4242
'columns': [<c0>, <c1>, ...],
43-
'pandas_version': $VERSION}
43+
'pandas_version': $VERSION,
44+
'creator': {
45+
'library': $LIBRARY,
46+
'version': $LIBRARY_VERSION
47+
}}
4448
45-
Here, ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
49+
The "descriptor" values ``<descr0>`` in the ``'index_columns'`` field are
50+
strings (referring to a column) or dictionaries with values as described below.
51+
52+
The ``<c0>``/``<ci0>`` and so forth are dictionaries containing the metadata
4653
for each column, *including the index columns*. This has JSON form:
4754

4855
.. code-block:: text
@@ -53,26 +60,37 @@ for each column, *including the index columns*. This has JSON form:
5360
'numpy_type': numpy_type,
5461
'metadata': metadata}
5562
56-
.. note::
63+
See below for the detailed specification for these.
64+
65+
Index Metadata Descriptors
66+
~~~~~~~~~~~~~~~~~~~~~~~~~~
67+
68+
``RangeIndex`` can be stored as metadata only, not requiring serialization. The
69+
descriptor format for these as is follows:
5770

58-
Every index column is stored with a name matching the pattern
59-
``__index_level_\d+__`` and its corresponding column information is can be
60-
found with the following code snippet.
71+
.. code-block:: python
6172
62-
Following this naming convention isn't strictly necessary, but strongly
63-
suggested for compatibility with Arrow.
73+
index = pd.RangeIndex(0, 10, 2)
74+
{'kind': 'range',
75+
'name': index.name,
76+
'start': index.start,
77+
'stop': index.stop,
78+
'step': index.step}
6479
65-
Here's an example of how the index metadata is structured in pyarrow:
80+
Other index types must be serialized as data columns along with the other
81+
DataFrame columns. The metadata for these is a string indicating the name of
82+
the field in the data columns, for example ``'__index_level_0__'``.
6683

67-
.. code-block:: python
84+
If an index has a non-None ``name`` attribute, and there is no other column
85+
with a name matching that value, then the ``index.name`` value can be used as
86+
the descriptor. Otherwise (for unnamed indexes and ones with names colliding
87+
with other column names) a disambiguating name with pattern matching
88+
``__index_level_\d+__`` should be used. In cases of named indexes as data
89+
columns, ``name`` attribute is always stored in the column descriptors as
90+
above.
6891

69-
# assuming there's at least 3 levels in the index
70-
index_columns = metadata['index_columns'] # noqa: F821
71-
columns = metadata['columns'] # noqa: F821
72-
ith_index = 2
73-
assert index_columns[ith_index] == '__index_level_2__'
74-
ith_index_info = columns[-len(index_columns):][ith_index]
75-
ith_index_level_name = ith_index_info['name']
92+
Column Metadata
93+
~~~~~~~~~~~~~~~
7694

7795
``pandas_type`` is the logical type of the column, and is one of:
7896

@@ -161,4 +179,8 @@ As an example of fully-formed metadata:
161179
'numpy_type': 'int64',
162180
'metadata': None}
163181
],
164-
'pandas_version': '0.20.0'}
182+
'pandas_version': '0.20.0',
183+
'creator': {
184+
'library': 'pyarrow',
185+
'version': '0.13.0'
186+
}}

0 commit comments

Comments
 (0)