-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Draft metadata specification doc for Apache Parquet #16315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
47caeb8
0c57d65
656acbe
17c6ba3
2155ea9
d2c66d8
e0a176e
67448be
a2a42c0
2d00f55
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,3 +16,120 @@ Developer | |
********* | ||
|
||
This section will focus on downstream applications of pandas. | ||
|
||
.. _apache.parquet: | ||
|
||
Storing pandas DataFrame objects in Apache Parquet format | ||
--------------------------------------------------------- | ||
|
||
The `Apache Parquet <https://github.com/apache/parquet-format>`__ format | ||
provides key-value metadata at the file and column level, stored in the footer | ||
of the Parquet file: | ||
|
||
.. code-block:: shell | ||
|
||
5: optional list<KeyValue> key_value_metadata | ||
|
||
where ``KeyValue`` is | ||
|
||
.. code-block:: shell | ||
|
||
struct KeyValue { | ||
1: required string key | ||
2: optional string value | ||
} | ||
|
||
So that a ``pandas.DataFrame`` can be faithfully reconstructed, we store a | ||
``pandas`` metadata key in the ``FileMetaData`` with the the value stored as : | ||
|
||
.. code-block:: text | ||
|
||
{'index_columns': ['__index_level_0__', '__index_level_1__', ...], | ||
'columns': [<c0>, <c1>, ...], | ||
'pandas_version': $VERSION} | ||
|
||
Here, ``<c0>`` and so forth are dictionaries containing the metadata for each | ||
column. This has JSON form: | ||
|
||
.. code-block:: text | ||
|
||
{'name': column_name, | ||
'pandas_type': pandas_type, | ||
'numpy_dtype': numpy_type, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. -> numpy_type (I think that's the spelling elsewhere) |
||
'metadata': type_metadata} | ||
|
||
``pandas_type`` is the logical type of the column, and is one of: | ||
|
||
* Boolean: ``'bool'`` | ||
* Integers: ``'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'`` | ||
* Floats: ``'float16', 'float32', 'float64'`` | ||
* Date and Time Types: ``'datetime', 'datetimetz'``, ``'timedelta'`` | ||
* String: ``'unicode', 'bytes'`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. assume utf-8? (if its not, then would be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added optional encoding metadata |
||
* Categorical: ``'categorical'`` | ||
* Other Python objects: ``'object'`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you store the categorical types as a nested specification? (e.g. ints, string, etc).? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good catch, will do |
||
|
||
The ``numpy_type`` is the physical storage type of the column, which is the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add timedelta type There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added a |
||
result of ``str(dtype)`` for the underlying NumPy array that holds the data. So | ||
for ``datetimetz`` this is ``datetime64[ns]`` and for categorical, it may be | ||
any of the supported integer categorical types. | ||
|
||
The ``type_metadata`` is ``None`` except for: | ||
|
||
* ``datetimetz``: ``{'timezone': zone, 'unit': 'ns'}``, e.g. ``{'timezone', | ||
'America/New_York', 'unit': 'ns'}``. The ``'unit'`` is optional, and if | ||
omitted it is assumed to be nanoseconds. | ||
* ``categorical``: ``{'num_categories': K, 'ordered': is_ordered, 'type': $TYPE}`` | ||
|
||
* Here ``'type'`` is optional, and can be a nested pandas type specification | ||
here (but not categorical) | ||
|
||
* ``unicode``: ``{'encoding': encoding}`` | ||
|
||
* The encoding is optional, and if not present is UTF-8 | ||
|
||
* ``object``: ``{'encoding': encoding}``. Objects can be serialized and stored | ||
in ``BYTE_ARRAY`` Parquet columns. The encoding can be one of: | ||
|
||
* ``'pickle'`` | ||
* ``'msgpack'`` | ||
* ``'bson'`` | ||
* ``'json'`` | ||
|
||
* ``timedelta``: ``{'unit': 'ns'}``. The ``'unit'`` is optional, and if omitted | ||
it is assumed to be nanoseconds. This metadata is optional altogether | ||
|
||
For types other than these, the ``'metadata'`` key can be | ||
omitted. Implementations can assume ``None`` if the key is not present. | ||
|
||
As an example of fully-formed metadata: | ||
|
||
.. code-block:: text | ||
|
||
{'index_columns': ['__index_level_0__'], | ||
'columns': [ | ||
{'name': 'c0', | ||
'pandas_type': 'int8', | ||
'numpy_type': 'int8', | ||
'metadata': None}, | ||
{'name': 'c1', | ||
'pandas_type': 'bytes', | ||
'numpy_type': 'object', | ||
'metadata': None}, | ||
{'name': 'c2', | ||
'pandas_type': 'categorical', | ||
'numpy_type': 'int16', | ||
'metadata': {'num_categories': 1000, 'ordered': False}}, | ||
{'name': 'c3', | ||
'pandas_type': 'datetimetz', | ||
'numpy_type': 'datetime64[ns]', | ||
'metadata': {'timezone': 'America/Los_Angeles'}}, | ||
{'name': 'c4', | ||
'pandas_type': 'object', | ||
'numpy_type': 'object', | ||
'metadata': {'encoding': 'pickle'}}, | ||
{'name': '__index_level_0__', | ||
'pandas_type': 'int64', | ||
'numpy_type': 'int64', | ||
'metadata': None} | ||
], | ||
'pandas_version': '0.20.0'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ref-tag here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done