Skip to content

Latest commit

 

History

History
122 lines (93 loc) · 3.47 KB

developer.rst

File metadata and controls

122 lines (93 loc) · 3.47 KB
.. currentmodule:: pandas

.. ipython:: python
   :suppress:

   import numpy as np
   np.random.seed(123456)
   np.set_printoptions(precision=4, suppress=True)
   import pandas as pd
   pd.options.display.max_rows = 15

Developer

This section will focus on downstream applications of pandas.

Storing pandas DataFrame objects in Apache Parquet format

The Apache Parquet format provides key-value metadata at the file and column level, stored in the footer of the Parquet file:

5: optional list<KeyValue> key_value_metadata

where KeyValue is

struct KeyValue {
  1: required string key
  2: optional string value
}

So that a pandas.DataFrame can be faithfully reconstructed, we store a pandas metadata key in the FileMetaData with the the value stored as :

{'index_columns': ['__index_level_0__', '__index_level_1__', ...],
 'columns': [<c0>, <c1>, ...],
 'pandas_version': $VERSION}

Here, <c0> and so forth are dictionaries containing the metadata for each column. This has JSON form:

{'name': column_name,
 'pandas_type': pandas_type,
 'numpy_dtype': numpy_type,
 'metadata': type_metadata}

pandas_type is the logical type of the column, and is one of:

  • Boolean: 'bool'
  • Integers: 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64'
  • Floats: 'float16', 'float32', 'float64'
  • Datetime: 'datetime', 'datetimetz'
  • String: 'unicode', 'bytes'
  • Categorical: 'categorical'
  • Other Python objects: 'object'

The numpy_type is the physical storage type of the column, which is the result of str(dtype) for the underlying NumPy array that holds the data. So for datetimetz this is datetime64[ns] and for categorical, it may be any of the supported integer categorical types.

The type_metadata is None except for:

  • datetimetz: {'timezone': zone}, e.g. {'timezone', 'America/New_York'}
  • categorical: {'num_categories': K, 'ordered': is_ordered}
  • object: {'encoding': encoding}

Objects can be serialized and stored in BYTE_ARRAY Parquet columns. The encoding can be one of:

  • 'pickle'
  • 'msgpack'
  • 'bson'
  • 'json'

For types other than these, the 'metadata' key can be omitted. Implementations can assume None if the key is not present.

As an example of fully-formed metadata:

{'index_columns': ['__index_level_0__'],
 'columns': [
     {'name': 'c0',
      'pandas_type': 'int8',
      'numpy_type': 'int8',
      'metadata': None},
     {'name': 'c1',
      'pandas_type': 'bytes',
      'numpy_type': 'object',
      'metadata': None},
     {'name': 'c2',
      'pandas_type': 'categorical',
      'numpy_type': 'int16',
      'metadata': {'num_categories': 1000, 'ordered': False}},
     {'name': 'c3',
      'pandas_type': 'datetimetz',
      'numpy_type': 'datetime64[ns]',
      'metadata': {'timezone': 'America/Los_Angeles'}},
     {'name': 'c4',
      'pandas_type': 'object',
      'numpy_type': 'object',
      'metadata': {'encoding': 'pickle'}},
     {'name': '__index_level_0__',
      'pandas_type': 'int64',
      'numpy_type': 'int64',
      'metadata': None}
 ],
 'pandas_version': '0.20.0'}