Skip to content

ENH: support for msgpack serialization/deserialization #3831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 1, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 10, 2013

extension of #3828

ToDo

  • remove use of pytest in test_msgpack
  • PERF!
msgpack serialization/deserialization

support all pandas objects: Timestamp,Period,all index types,Series,DataFrame,Panel,Sparse suite
docs included (in io.rst)
iterator support
top-level api support

compression and direct calls to in-line msgpack (enabled via #3828) will wait for 0.13+

closes #686
Benchmarking: 50k rows of 2x columns of floats with a datetime index

In [4]: %timeit df.to_msgpack('foo')
100 loops, best of 3: 16.5 ms per loop

In [2]: %timeit df.to_pickle('foo')
10 loops, best of 3: 12.9 ms per loop

In [6]: %timeit df.to_csv('foo')
1 loops, best of 3: 470 ms per loop

In [11]: %timeit df.to_hdf('foo2','df',mode='w')
10 loops, best of 3: 20.1 ms per loop

In [13]: %timeit df.to_hdf('foo2','df',mode='w',table=True)
10 loops, best of 3: 81.4 ms per loop
In [5]: %timeit pd.read_msgpack('foo')
100 loops, best of 3: 16.3 ms per loop

In [3]: %timeit pd.read_pickle('foo')
1000 loops, best of 3: 1.28 ms per loop

In [7]: %timeit pd.read_csv('foo')
10 loops, best of 3: 46.1 ms per loop

In [12]: %timeit pd.read_hdf('foo2','df')
100 loops, best of 3: 5.64 ms per loop

In [14]: %timeit pd.read_hdf('foo2','df')
100 loops, best of 3: 9.39 ms per loop
In [1]: df = DataFrame(randn(10,2),
   ...:                      columns=list('AB'),
   ...:                      index=date_range('20130101',periods=10))

In [2]: pd.to_msgpack('foo.msg',df)

In [3]: pd.read_msgpack('foo.msg')
Out[3]: 
                   A         B
2013-01-01  0.676700 -1.702599
2013-01-02 -0.070164 -1.368716
2013-01-03 -0.877145 -1.427964
2013-01-04 -0.295715 -0.176954
2013-01-05  0.566986  0.588918
2013-01-06 -0.307070  1.541773
2013-01-07  1.302388  0.689701
2013-01-08  0.165292  0.273496
2013-01-09 -3.492113 -1.178075
2013-01-10 -1.069521  0.848614

@jreback
Copy link
Contributor Author

jreback commented Jun 10, 2013

latest commit cdc870d basically puts float64/int64 directly in the msgpack pack loop, so getting better
I think to do much better need to actually write out the bytes of an array all at once (rather than 1 func call per)
hence basically need to write: msgpack_pack_double_array

packers_write_msgpack                        |  25.4264 |  41.4937 |   0.6128 |
packers_write_hdf_store                      |  11.0367 |  11.3930 |   0.9687 |
packers_write_pickle                         |   6.8083 |   6.8933 |   0.9877 |

@jreback
Copy link
Contributor Author

jreback commented Jun 10, 2013

getting closer.......was dumbly serializing indicies....

so now float64 and int64 ndarrays are serialized pretty fast...and that's the basis for most data
and indicies.....

reading is still somewhat slow....next step is to see if we can create a custom type

packers_write_msgpack                        |  16.0396 |  41.7344 |   0.3843 |
packers_write_csv                          | 1516.5110  |1541.2530 |   0.9839 |
packers_write_hdf_store                      |  11.0253 |  11.1123 |   0.9922 |
packers_write_pickle                         |   6.5717 |   6.5117 |   1.0092 |
packers_write_hdf_table                      | 424.1536 | 410.7641 |   1.0326 |

@wesm
Copy link
Member

wesm commented Jun 10, 2013

I think this needs to wait til 0.12 and see some more work (esp on performance)

@jreback
Copy link
Contributor Author

jreback commented Jun 10, 2013

that's fine

next step is prob a custom type otherwise reading is pretty slow

@jreback
Copy link
Contributor Author

jreback commented Jul 24, 2013

@wesm
this is also ready to go....but of when you have a chance if you want to take a stab at some perf improvements...latest figures are at the top

@wesm
Copy link
Member

wesm commented Jul 25, 2013

Do we know what's contributing to the slow read performance on msgpack'd Series?

@jreback
Copy link
Contributor Author

jreback commented Jul 25, 2013

its processing it one structure at a time, e.g. read next structure, then decode

the convert/unconvert functions are i python so their is room to put them in cython (I tried that and it didn't really change much...by maybe didn't do correctly)

ndarrays are basically stored as lists, so calling tolist.

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2013

@wesm have you had a chance to look at this? any real perf gain will have to come from defining a numpy type in msgpack I think....add to 0.13 to get tires kicked? (coulld mark as experimental)

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2013

@y-p @hayd @cpcloud @jtratner thoughts?

@wesm
Copy link
Member

wesm commented Sep 10, 2013

well, the concern would be stability of the format, do you have a sense of that? I haven't had a chance to look closely yet

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2013

I believe the format itself is pretty stable. What needs to be done to get some really good performance is extend the msgpack spec via the extension types: https://github.com/msgpack/msgpack/blob/master/spec.md#types-extension-type, prob just for numpy arrays. So that these types become more efficient to read.

Not sure for time scale for 0.13. But even if the format is subsequently changed (e.g .more/better extension types are added). I don't think back compat is that hard to maintain.

@drasch
Copy link
Contributor

drasch commented Sep 18, 2013

see msgpack_numpy https://github.com/lebedov/msgpack_numpy

@jreback
Copy link
Contributor Author

jreback commented Sep 18, 2013

@drasch already incorporated that in here. What I am talking about is cythonizing the numpy array itself.

@jreback
Copy link
Contributor Author

jreback commented Sep 30, 2013

@wesm @jtratner @cpcloud clamor for this in 0.13? defer?

@wesm
Copy link
Member

wesm commented Sep 30, 2013

I'd like to dig through the serialization performance issue. Can we merge/ship this and simply note in the docstrings / docs that the binary format should not be expected to be stable for a while?

@wesm
Copy link
Member

wesm commented Sep 30, 2013

Not sure when I'll get to the digging though, but I want to.

@jreback
Copy link
Contributor Author

jreback commented Sep 30, 2013

sure....mark as experimental (not that stops anyone from complaining when it breaks/changes......)

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2013

@jtratner @cpcloud this is big (bug straightforward), prob needs some more tests, but anything glaring?

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

I'll take a look soon. Can you run something like flake8 over the Python
parts?

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

Is the kwarg really supposed to be unicode_erros like it says in the docstring?

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2013

that looks like a typo from the original msgpack code (@wesm incorporated it here), but mostly just copy/paste
(should be unicode_errors)

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2013

pep8d the python files...

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

Pep8 is good, but pyflakes or pylint check for undefined variables,
redefinition of variables, which can be helpful

@jtratner
Copy link
Contributor

jtratner commented Oct 1, 2013

Would it make sense to make the baseclass for UnpackException IOError or something like that?

These are definitely trivial notes. Still looking at it for more substantive things.

@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2013

yep....IOError sounds good

.. ipython:: python
.. warning::

Since this is EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need an "an" after "this is"


cdef inline pack_pair(self, object k, object v, int nest_limit):
ret = self._pack(k, nest_limit-1)
if ret != 0: raise Exception("cannot pack : %s" % k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we change this to UnpackException or something? I know I'm very anti-Exception.

wesm and others added 4 commits October 1, 2013 09:13
DOC: install.rst mention

DOC: added license from msgpack_numpy

PERF: changed Timestamp and DatetimeIndex serialization for speedups

      add vb_suite benchmarks

ENH: added to_msgpack method in generic.py, and default import into pandas

TST: all packers to always be imported, fail on usage with no msgpack installed

DOC: added mentions in release notes, v0.11.1, basics

ENH: provide automatic list if multiple args passed to to_msgpack

DOC: changed docs to 0.12

ENH: iterator support for stream unpacking

Conflicts:

	RELEASE.rst

ENH: added support for Panel,SparseSeries,SparseDataFrame,SparsePanel,IntIndex,BlockIndex

ENH: handle np.datetime64,np.timedelta64,date,timedelta types

TST: added compression (zlib/blosc) via big hack

DOC: moved back to 0.11.1 docs

BLD: integrated with built-in msgpack

DOC: io.rst fixes

PERF: update vb_suite for packers

TST: fix for test_list_float_complex test?

PERF: prototype for packing faster

PERF: was still using tolist on indicies

DOC: v0.13.0.txt and release notes

DOC: release notes

PERF: revamples packers vbench to use packers,csv,pickle,hdf_store,hdf_table

TST: better test comparison s for numpy types

BLD: py3k compat
TST: removed pytest in favor of nosetest for tests/test_msgpack
@jreback
Copy link
Contributor Author

jreback commented Oct 1, 2013

bombs away

jreback added a commit that referenced this pull request Oct 1, 2013
ENH: support for msgpack serialization/deserialization
@jreback jreback merged commit 16d03b7 into pandas-dev:master Oct 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants