Skip to content

ENH: support for msgpack serialization/deserialization #3831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 1, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions LICENSES/MSGPACK_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Copyright (C) 2008-2011 INADA Naoki <[email protected]>

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
33 changes: 33 additions & 0 deletions LICENSES/MSGPACK_NUMPY_LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
.. -*- rst -*-

License
=======

Copyright (c) 2013, Lev Givon.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided
with the distribution.
* Neither the name of Lev Givon nor the names of any
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
68 changes: 68 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ object.
* ``read_hdf``
* ``read_sql``
* ``read_json``
* ``read_msgpack`` (experimental)
* ``read_html``
* ``read_stata``
* ``read_clipboard``
Expand All @@ -48,6 +49,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
* ``to_hdf``
* ``to_sql``
* ``to_json``
* ``to_msgpack`` (experimental)
* ``to_html``
* ``to_stata``
* ``to_clipboard``
Expand Down Expand Up @@ -1732,6 +1734,72 @@ module is installed you can use it as a xlsx writer engine as follows:

.. _io.hdf5:

Serialization
-------------

msgpack (experimental)
~~~~~~~~~~~~~~~~~~~~~~

.. _io.msgpack:

.. versionadded:: 0.13.0

Starting in 0.13.0, pandas is supporting the ``msgpack`` format for
object serialization. This is a lightweight portable binary format, similar
to binary JSON, that is highly space efficient, and provides good performance
both on the writing (serialization), and reading (deserialization).

.. warning::

This is a very new feature of pandas. We intend to provide certain
optimizations in the io of the ``msgpack`` data. Since this is marked
as an EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.

.. ipython:: python

df = DataFrame(np.random.rand(5,2),columns=list('AB'))
df.to_msgpack('foo.msg')
pd.read_msgpack('foo.msg')
s = Series(np.random.rand(5),index=date_range('20130101',periods=5))

You can pass a list of objects and you will receive them back on deserialization.

.. ipython:: python

pd.to_msgpack('foo.msg', df, 'foo', np.array([1,2,3]), s)
pd.read_msgpack('foo.msg')

You can pass ``iterator=True`` to iterate over the unpacked results

.. ipython:: python

for o in pd.read_msgpack('foo.msg',iterator=True):
print o

You can pass ``append=True`` to the writer to append to an existing pack

.. ipython:: python

df.to_msgpack('foo.msg',append=True)
pd.read_msgpack('foo.msg')

Unlike other io methods, ``to_msgpack`` is available on both a per-object basis,
``df.to_msgpack()`` and using the top-level ``pd.to_msgpack(...)`` where you
can pack arbitrary collections of python lists, dicts, scalars, while intermixing
pandas objects.

.. ipython:: python

pd.to_msgpack('foo2.msg', { 'dict' : [ { 'df' : df }, { 'string' : 'foo' }, { 'scalar' : 1. }, { 's' : s } ] })
pd.read_msgpack('foo2.msg')

.. ipython:: python
:suppress:
:okexcept:

os.remove('foo.msg')
os.remove('foo2.msg')

HDF5 (PyTables)
---------------

Expand Down
24 changes: 13 additions & 11 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,17 +64,19 @@ New features
Experimental Features
~~~~~~~~~~~~~~~~~~~~~

- The new :func:`~pandas.eval` function implements expression evaluation using
``numexpr`` behind the scenes. This results in large speedups for complicated
expressions involving large DataFrames/Series.
- :class:`~pandas.DataFrame` has a new :meth:`~pandas.DataFrame.eval` that
evaluates an expression in the context of the ``DataFrame``.
- A :meth:`~pandas.DataFrame.query` method has been added that allows
you to select elements of a ``DataFrame`` using a natural query syntax nearly
identical to Python syntax.
- ``pd.eval`` and friends now evaluate operations involving ``datetime64``
objects in Python space because ``numexpr`` cannot handle ``NaT`` values
(:issue:`4897`).
- The new :func:`~pandas.eval` function implements expression evaluation using
``numexpr`` behind the scenes. This results in large speedups for complicated
expressions involving large DataFrames/Series.
- :class:`~pandas.DataFrame` has a new :meth:`~pandas.DataFrame.eval` that
evaluates an expression in the context of the ``DataFrame``.
- A :meth:`~pandas.DataFrame.query` method has been added that allows
you to select elements of a ``DataFrame`` using a natural query syntax nearly
identical to Python syntax.
- ``pd.eval`` and friends now evaluate operations involving ``datetime64``
objects in Python space because ``numexpr`` cannot handle ``NaT`` values
(:issue:`4897`).
- Add msgpack support via ``pd.read_msgpack()`` and ``pd.to_msgpack()/df.to_msgpack()`` for serialization
of arbitrary pandas (and python objects) in a lightweight portable binary format (:issue:`686`)

Improvements to existing features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
41 changes: 32 additions & 9 deletions doc/source/v0.13.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,15 @@ Enhancements
t = Timestamp('20130101 09:01:02')
t + pd.datetools.Nano(123)

- The ``isin`` method plays nicely with boolean indexing. To get the rows where each condition is met:

.. ipython:: python

mask = df.isin({'A': [1, 2], 'B': ['e', 'f']})
df[mask.all(1)]

See the :ref:`documentation<indexing.basics.indexing_isin>` for more.

.. _whatsnew_0130.experimental:

Experimental
Expand Down Expand Up @@ -553,21 +562,35 @@ Experimental
For more details see the :ref:`indexing documentation on query
<indexing.query>`.

- DataFrame now has an ``isin`` method that can be used to easily check whether the DataFrame's values are contained in an iterable. Use a dictionary if you'd like to check specific iterables for specific columns or rows.
- ``pd.read_msgpack()`` and ``pd.to_msgpack()`` are now a supported method of serialization
of arbitrary pandas (and python objects) in a lightweight portable binary format. :ref:`See the docs<io.msgpack>`

.. ipython:: python
.. warning::

Since this is an EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['d', 'e', 'f']})
df.isin({'A': [1, 2], 'B': ['e', 'f']})
.. ipython:: python

The ``isin`` method plays nicely with boolean indexing. To get the rows where each condition is met:
df = DataFrame(np.random.rand(5,2),columns=list('AB'))
df.to_msgpack('foo.msg')
pd.read_msgpack('foo.msg')

.. ipython:: python
s = Series(np.random.rand(5),index=date_range('20130101',periods=5))
pd.to_msgpack('foo.msg', df, s)
pd.read_msgpack('foo.msg')

mask = df.isin({'A': [1, 2], 'B': ['e', 'f']})
df[mask.all(1)]
You can pass ``iterator=True`` to iterator over the unpacked results

.. ipython:: python

for o in pd.read_msgpack('foo.msg',iterator=True):
print o

.. ipython:: python
:suppress:
:okexcept:

See the :ref:`documentation<indexing.basics.indexing_isin>` for more.
os.remove('foo.msg')

.. _whatsnew_0130.refactoring:

Expand Down
19 changes: 19 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -805,6 +805,25 @@ def to_hdf(self, path_or_buf, key, **kwargs):
from pandas.io import pytables
return pytables.to_hdf(path_or_buf, key, self, **kwargs)

def to_msgpack(self, path_or_buf, **kwargs):
"""
msgpack (serialize) object to input file path

THIS IS AN EXPERIMENTAL LIBRARY and the storage format
may not be stable until a future release.

Parameters
----------
path : string File path
args : an object or objects to serialize
append : boolean whether to append to an existing msgpack
(default is False)
compress : type of compressor (zlib or blosc), default to None (no compression)
"""

from pandas.io import packers
return packers.to_msgpack(path_or_buf, self, **kwargs)

def to_pickle(self, path):
"""
Pickle (serialize) object to input file path
Expand Down
1 change: 1 addition & 0 deletions pandas/io/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@
from pandas.io.sql import read_sql
from pandas.io.stata import read_stata
from pandas.io.pickle import read_pickle, to_pickle
from pandas.io.packers import read_msgpack, to_msgpack
Loading