ENH: support for msgpack serialization/deserialization #3831

jreback · 2013-06-10T14:10:57Z

extension of #3828

ToDo

remove use of pytest in test_msgpack
PERF!

msgpack serialization/deserialization

support all pandas objects: Timestamp,Period,all index types,Series,DataFrame,Panel,Sparse suite
docs included (in io.rst)
iterator support
top-level api support

compression and direct calls to in-line msgpack (enabled via #3828) will wait for 0.13+

closes #686

Benchmarking: 50k rows of 2x columns of floats with a datetime index

In [4]: %timeit df.to_msgpack('foo')
100 loops, best of 3: 16.5 ms per loop

In [2]: %timeit df.to_pickle('foo')
10 loops, best of 3: 12.9 ms per loop

In [6]: %timeit df.to_csv('foo')
1 loops, best of 3: 470 ms per loop

In [11]: %timeit df.to_hdf('foo2','df',mode='w')
10 loops, best of 3: 20.1 ms per loop

In [13]: %timeit df.to_hdf('foo2','df',mode='w',table=True)
10 loops, best of 3: 81.4 ms per loop

In [5]: %timeit pd.read_msgpack('foo')
100 loops, best of 3: 16.3 ms per loop

In [3]: %timeit pd.read_pickle('foo')
1000 loops, best of 3: 1.28 ms per loop

In [7]: %timeit pd.read_csv('foo')
10 loops, best of 3: 46.1 ms per loop

In [12]: %timeit pd.read_hdf('foo2','df')
100 loops, best of 3: 5.64 ms per loop

In [14]: %timeit pd.read_hdf('foo2','df')
100 loops, best of 3: 9.39 ms per loop

In [1]: df = DataFrame(randn(10,2),
   ...:                      columns=list('AB'),
   ...:                      index=date_range('20130101',periods=10))

In [2]: pd.to_msgpack('foo.msg',df)

In [3]: pd.read_msgpack('foo.msg')
Out[3]: 
                   A         B
2013-01-01  0.676700 -1.702599
2013-01-02 -0.070164 -1.368716
2013-01-03 -0.877145 -1.427964
2013-01-04 -0.295715 -0.176954
2013-01-05  0.566986  0.588918
2013-01-06 -0.307070  1.541773
2013-01-07  1.302388  0.689701
2013-01-08  0.165292  0.273496
2013-01-09 -3.492113 -1.178075
2013-01-10 -1.069521  0.848614

jreback · 2013-06-10T20:16:56Z

latest commit cdc870d basically puts float64/int64 directly in the msgpack pack loop, so getting better
I think to do much better need to actually write out the bytes of an array all at once (rather than 1 func call per)
hence basically need to write: msgpack_pack_double_array

packers_write_msgpack                        |  25.4264 |  41.4937 |   0.6128 |
packers_write_hdf_store                      |  11.0367 |  11.3930 |   0.9687 |
packers_write_pickle                         |   6.8083 |   6.8933 |   0.9877 |

jreback · 2013-06-10T23:23:41Z

getting closer.......was dumbly serializing indicies....

so now float64 and int64 ndarrays are serialized pretty fast...and that's the basis for most data
and indicies.....

reading is still somewhat slow....next step is to see if we can create a custom type

packers_write_msgpack                        |  16.0396 |  41.7344 |   0.3843 |
packers_write_csv                          | 1516.5110  |1541.2530 |   0.9839 |
packers_write_hdf_store                      |  11.0253 |  11.1123 |   0.9922 |
packers_write_pickle                         |   6.5717 |   6.5117 |   1.0092 |
packers_write_hdf_table                      | 424.1536 | 410.7641 |   1.0326 |

wesm · 2013-06-10T23:34:36Z

I think this needs to wait til 0.12 and see some more work (esp on performance)

jreback · 2013-06-10T23:47:47Z

that's fine

next step is prob a custom type otherwise reading is pretty slow

jreback · 2013-07-24T23:18:45Z

@wesm
this is also ready to go....but of when you have a chance if you want to take a stab at some perf improvements...latest figures are at the top

wesm · 2013-07-25T17:43:47Z

Do we know what's contributing to the slow read performance on msgpack'd Series?

jreback · 2013-07-25T17:47:52Z

its processing it one structure at a time, e.g. read next structure, then decode

the convert/unconvert functions are i python so their is room to put them in cython (I tried that and it didn't really change much...by maybe didn't do correctly)

ndarrays are basically stored as lists, so calling tolist.

jreback · 2013-09-10T14:40:05Z

@wesm have you had a chance to look at this? any real perf gain will have to come from defining a numpy type in msgpack I think....add to 0.13 to get tires kicked? (coulld mark as experimental)

jreback · 2013-09-10T14:40:47Z

@y-p @hayd @cpcloud @jtratner thoughts?

wesm · 2013-09-10T19:31:38Z

well, the concern would be stability of the format, do you have a sense of that? I haven't had a chance to look closely yet

jreback · 2013-09-10T19:40:16Z

I believe the format itself is pretty stable. What needs to be done to get some really good performance is extend the msgpack spec via the extension types: https://github.com/msgpack/msgpack/blob/master/spec.md#types-extension-type, prob just for numpy arrays. So that these types become more efficient to read.

Not sure for time scale for 0.13. But even if the format is subsequently changed (e.g .more/better extension types are added). I don't think back compat is that hard to maintain.

drasch · 2013-09-18T20:28:32Z

see msgpack_numpy https://github.com/lebedov/msgpack_numpy

jreback · 2013-09-18T20:29:55Z

@drasch already incorporated that in here. What I am talking about is cythonizing the numpy array itself.

jreback · 2013-09-30T23:17:19Z

@wesm @jtratner @cpcloud clamor for this in 0.13? defer?

wesm · 2013-09-30T23:20:42Z

I'd like to dig through the serialization performance issue. Can we merge/ship this and simply note in the docstrings / docs that the binary format should not be expected to be stable for a while?

wesm · 2013-09-30T23:20:56Z

Not sure when I'll get to the digging though, but I want to.

jreback · 2013-09-30T23:22:47Z

sure....mark as experimental (not that stops anyone from complaining when it breaks/changes......)

jreback · 2013-10-01T00:30:00Z

@jtratner @cpcloud this is big (bug straightforward), prob needs some more tests, but anything glaring?

jtratner · 2013-10-01T00:31:24Z

I'll take a look soon. Can you run something like flake8 over the Python
parts?

jtratner · 2013-10-01T00:33:29Z

Is the kwarg really supposed to be unicode_erros like it says in the docstring?

jreback · 2013-10-01T00:40:28Z

that looks like a typo from the original msgpack code (@wesm incorporated it here), but mostly just copy/paste
(should be unicode_errors)

jreback · 2013-10-01T00:43:38Z

pep8d the python files...

jtratner · 2013-10-01T01:05:34Z

Pep8 is good, but pyflakes or pylint check for undefined variables,
redefinition of variables, which can be helpful

jtratner · 2013-10-01T01:08:22Z

Would it make sense to make the baseclass for UnpackException IOError or something like that?

These are definitely trivial notes. Still looking at it for more substantive things.

jreback · 2013-10-01T01:15:06Z

yep....IOError sounds good

cpcloud · 2013-10-01T01:24:16Z

doc/source/v0.13.0.txt

-    .. ipython:: python
+  .. warning::
+
+     Since this is EXPERIMENTAL LIBRARY, the storage format may not be stable until a future release.


need an "an" after "this is"

jtratner · 2013-10-01T02:29:56Z

pandas/msgpack.pyx

+
+    cdef inline pack_pair(self, object k, object v, int nest_limit):
+        ret = self._pack(k, nest_limit-1)
+        if ret != 0: raise Exception("cannot pack : %s" % k)


can we change this to UnpackException or something? I know I'm very anti-Exception.

…das use cases, per pandas-dev#3814 and others

DOC: install.rst mention DOC: added license from msgpack_numpy PERF: changed Timestamp and DatetimeIndex serialization for speedups add vb_suite benchmarks ENH: added to_msgpack method in generic.py, and default import into pandas TST: all packers to always be imported, fail on usage with no msgpack installed DOC: added mentions in release notes, v0.11.1, basics ENH: provide automatic list if multiple args passed to to_msgpack DOC: changed docs to 0.12 ENH: iterator support for stream unpacking Conflicts: RELEASE.rst ENH: added support for Panel,SparseSeries,SparseDataFrame,SparsePanel,IntIndex,BlockIndex ENH: handle np.datetime64,np.timedelta64,date,timedelta types TST: added compression (zlib/blosc) via big hack DOC: moved back to 0.11.1 docs BLD: integrated with built-in msgpack DOC: io.rst fixes PERF: update vb_suite for packers TST: fix for test_list_float_complex test? PERF: prototype for packing faster PERF: was still using tolist on indicies DOC: v0.13.0.txt and release notes DOC: release notes PERF: revamples packers vbench to use packers,csv,pickle,hdf_store,hdf_table TST: better test comparison s for numpy types BLD: py3k compat

TST: removed pytest in favor of nosetest for tests/test_msgpack

jreback · 2013-10-01T13:54:06Z

bombs away

ENH: support for msgpack serialization/deserialization

This was referenced Jun 10, 2013

ENH: support for msgpack serialization/deserialization #3525

Closed

ENH: add compression / direct calls to inline-msgpack #3832

Closed

jreback mentioned this pull request Sep 22, 2013

Add msgpack as submodule for pandas #3828

Closed

cpcloud reviewed Oct 1, 2013
View reviewed changes

jtratner reviewed Oct 1, 2013
View reviewed changes

wesm and others added 4 commits October 1, 2013 09:13

ENH: ship msgpack as pandas submodule to enable customization for pan…

1501356

…das use cases, per pandas-dev#3814 and others

BLD: py3 compat

bac7817

TST: removed pytest in favor of nosetest for tests/test_msgpack

CLN: autopep8 packers.py/test_packers.py

80651ca

jreback added a commit that referenced this pull request Oct 1, 2013

Merge pull request #3831 from jreback/msgpack3

16d03b7

ENH: support for msgpack serialization/deserialization

jreback merged commit 16d03b7 into pandas-dev:master Oct 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support for msgpack serialization/deserialization #3831

ENH: support for msgpack serialization/deserialization #3831

jreback commented Jun 10, 2013

jreback commented Jun 10, 2013

jreback commented Jun 10, 2013

wesm commented Jun 10, 2013

jreback commented Jun 10, 2013

jreback commented Jul 24, 2013

wesm commented Jul 25, 2013

jreback commented Jul 25, 2013

jreback commented Sep 10, 2013

jreback commented Sep 10, 2013

wesm commented Sep 10, 2013

jreback commented Sep 10, 2013

drasch commented Sep 18, 2013

jreback commented Sep 18, 2013

jreback commented Sep 30, 2013

wesm commented Sep 30, 2013

wesm commented Sep 30, 2013

jreback commented Sep 30, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 1, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 1, 2013

cpcloud Oct 1, 2013

jtratner Oct 1, 2013

jreback commented Oct 1, 2013

ENH: support for msgpack serialization/deserialization #3831

ENH: support for msgpack serialization/deserialization #3831

Conversation

jreback commented Jun 10, 2013

jreback commented Jun 10, 2013

jreback commented Jun 10, 2013

wesm commented Jun 10, 2013

jreback commented Jun 10, 2013

jreback commented Jul 24, 2013

wesm commented Jul 25, 2013

jreback commented Jul 25, 2013

jreback commented Sep 10, 2013

jreback commented Sep 10, 2013

wesm commented Sep 10, 2013

jreback commented Sep 10, 2013

drasch commented Sep 18, 2013

jreback commented Sep 18, 2013

jreback commented Sep 30, 2013

wesm commented Sep 30, 2013

wesm commented Sep 30, 2013

jreback commented Sep 30, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 1, 2013

jreback commented Oct 1, 2013

jtratner commented Oct 1, 2013

jtratner commented Oct 1, 2013

jreback commented Oct 1, 2013

cpcloud Oct 1, 2013

Choose a reason for hiding this comment

jtratner Oct 1, 2013

Choose a reason for hiding this comment

jreback commented Oct 1, 2013