Skip to content

ENH: Draft metadata specification doc for Apache Parquet #16315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 16, 2017
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions doc/source/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,107 @@ Developer
*********

This section will focus on downstream applications of pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref-tag here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Storing pandas DataFrame objects in Apache Parquet format
---------------------------------------------------------

The `Apache Parquet <https://github.com/apache/parquet-format>`__ format
provides key-value metadata at the file and column level, stored in the footer
of the Parquet file:

.. code-block:: shell

5: optional list<KeyValue> key_value_metadata

where ``KeyValue`` is

.. code-block:: shell

struct KeyValue {
1: required string key
2: optional string value
}

So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a
``pandas`` metadata key in the ``FileMetaData`` with the the value stored as :

.. code-block:: text

{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
'columns': [<c0>, <c1>, ...],
'pandas_version': $VERSION}

Here, ``<c0>`` and so forth are dictionaries containing the metadata for each
column. This has JSON form:

.. code-block:: text

{'name': column_name,
'pandas_type': pandas_type,
'numpy_dtype': numpy_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> numpy_type (I think that's the spelling elsewhere)

'metadata': type_metadata}

``pandas_type`` is the logical type of the column, and is one of:

* Boolean: ``'bool'``
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'``
* Floats: ``'float16', 'float32', 'float64'``
* Datetime: ``'datetime', 'datetimetz'``
* String: ``'unicode', 'bytes'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume utf-8? (if its not, then would be object?), or is it possible to provide a string encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added optional encoding metadata

* Categorical: ``'categorical'``
* Other Python objects: ``'object'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you store the categorical types as a nested specification? (e.g. ints, string, etc).?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will do


The ``numpy_type`` is the physical storage type of the column, which is the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add timedelta type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a timedelta type with optional metadata indicating the unit

result of ``str(dtype)`` for the underlying NumPy array that holds the data. So
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be
any of the supported integer categorical types.

The ``type_metadata`` is ``None`` except for:

* ``datetimetz``: ``{'timezone': zone}``, e.g. ``{'timezone', 'America/New_York'}``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe unit on the datetime for future compat?

* ``categorical``: ``{'num_categories': K, 'ordered': is_ordered}``
* ``object``: ``{'encoding': encoding}``

Objects can be serialized and stored in ``BYTE_ARRAY`` Parquet columns. The
encoding can be one of:

* ``'pickle'``
* ``'msgpack'``
* ``'bson'``
* ``'json'``

For types other than these, the ``'metadata'`` key can be
omitted. Implementations can assume ``None`` if the key is not present.

As an example of fully-formed metadata:

.. code-block:: text

{'index_columns': ['__index_level_0__'],
'columns': [
{'name': 'c0',
'pandas_type': 'int8',
'numpy_type': 'int8',
'metadata': None},
{'name': 'c1',
'pandas_type': 'bytes',
'numpy_type': 'object',
'metadata': None},
{'name': 'c2',
'pandas_type': 'categorical',
'numpy_type': 'int16',
'metadata': {'num_categories': 1000, 'ordered': False}},
{'name': 'c3',
'pandas_type': 'datetimetz',
'numpy_type': 'datetime64[ns]',
'metadata': {'timezone': 'America/Los_Angeles'}},
{'name': 'c4',
'pandas_type': 'object',
'numpy_type': 'object',
'metadata': {'encoding': 'pickle'}},
{'name': '__index_level_0__',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}
],
'pandas_version': '0.20.0'}
1 change: 1 addition & 0 deletions doc/source/index.rst.template
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ See the package overview for more detail about what's in the library.
comparison_with_r
comparison_with_sql
comparison_with_sas
metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nix this

{% endif -%}
{% if api -%}
api
Expand Down