.. currentmodule:: pandas
.. ipython:: python :suppress: import numpy as np np.random.seed(123456) np.set_printoptions(precision=4, suppress=True) import pandas as pd pd.options.display.max_rows = 15
This section will focus on downstream applications of pandas.
The Apache Parquet format provides key-value metadata at the file and column level, stored in the footer of the Parquet file:
5: optional list<KeyValue> key_value_metadata
where KeyValue
is
struct KeyValue {
1: required string key
2: optional string value
}
So that a pandas.DataFrame
can be faithfully reconstructed, we store a
pandas
metadata key in the FileMetaData
with the the value stored as :
{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
'columns': [<c0>, <c1>, ...],
'pandas_version': $VERSION}
Here, <c0>
and so forth are dictionaries containing the metadata for each
column. This has JSON form:
{'name': column_name,
'pandas_type': pandas_type,
'numpy_dtype': numpy_type,
'metadata': type_metadata}
pandas_type
is the logical type of the column, and is one of:
- Boolean:
'bool'
- Integers:
'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'
- Floats:
'float16', 'float32', 'float64'
- Datetime:
'datetime', 'datetimetz'
- String:
'unicode', 'bytes'
- Categorical:
'categorical'
- Other Python objects:
'object'
The numpy_type
is the physical storage type of the column, which is the
result of str(dtype)
for the underlying NumPy array that holds the data. So
for datetimetz
this is datetime64[ns]
and for categorical, it may be
any of the supported integer categorical types.
The type_metadata
is None
except for:
datetimetz
:{'timezone': zone}
, e.g.{'timezone', 'America/New_York'}
categorical
:{'num_categories': K, 'ordered': is_ordered}
object
:{'encoding': encoding}
Objects can be serialized and stored in BYTE_ARRAY
Parquet columns. The
encoding can be one of:
'pickle'
'msgpack'
'bson'
'json'
For types other than these, the 'metadata'
key can be
omitted. Implementations can assume None
if the key is not present.
As an example of fully-formed metadata:
{'index_columns': ['__index_level_0__'],
'columns': [
{'name': 'c0',
'pandas_type': 'int8',
'numpy_type': 'int8',
'metadata': None},
{'name': 'c1',
'pandas_type': 'bytes',
'numpy_type': 'object',
'metadata': None},
{'name': 'c2',
'pandas_type': 'categorical',
'numpy_type': 'int16',
'metadata': {'num_categories': 1000, 'ordered': False}},
{'name': 'c3',
'pandas_type': 'datetimetz',
'numpy_type': 'datetime64[ns]',
'metadata': {'timezone': 'America/Los_Angeles'}},
{'name': 'c4',
'pandas_type': 'object',
'numpy_type': 'object',
'metadata': {'encoding': 'pickle'}},
{'name': '__index_level_0__',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None}
],
'pandas_version': '0.20.0'}