Skip to content

PERF: msgpack encoding changes to use to/from string for speed boosts #5498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 13, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 12, 2013

API: disable sparse structure encodings and unicode indexes

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
packers_read_pack                            |   3.5113 |  17.8316 |   0.1969 |
packers_read_pickle                          |   0.6769 |   0.6770 |   0.9999 |
packers_write_pack                           |   1.8814 |   3.5230 |   0.5340 |
packers_write_pickle                         |   1.5387 |   1.4664 |   1.0493 |

packers_write_hdf_store                      |  12.1900 |  12.2033 |   0.9989 |
packers_read_csv                             |  52.3347 |  52.2310 |   1.0020 |
packers_write_csv                            | 536.2056 | 526.5187 |   1.0184 |
packers_write_hdf_table                      |  33.3436 |  32.4137 |   1.0287 |
packers_read_hdf_store                       |   8.3120 |   8.0493 |   1.0326 |
packers_read_hdf_table                       |  13.9607 |  12.9707 |   1.0763 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [381e86b] : PERF: msgpack encoding changnes to use to/from string for speed boosts
API: disable sparse structure encodings and unicode indexes
Base   [46008ec] : DOC: add cookbook entry
In [1]: from pandas.io.packers import pack

In [2]: import cPickle as pkl

In [3]: df = pd.DataFrame(np.random.rand(1000, 100))

In [6]: %timeit buf = pack(df)
1000 loops, best of 3: 492 ᄉs per loop

In [7]: %timeit buf = pkl.dumps(df,pkl.HIGHEST_PROTOCOL)
1000 loops, best of 3: 681 ᄉs per loop

In [8]: df = pd.DataFrame(np.random.rand(100000, 100))

In [11]:  %timeit buf = pack(df)
1 loops, best of 3: 184 ms per loop

In [12]: %timeit buf = pkl.dumps(df,pkl.HIGHEST_PROTOCOL)
10 loops, best of 3: 111 ms per loop

now pretty competitive with pickle

note on bigger frames, writing an in-memory hdf file is quite fast

In [3]: def f(x):
   ...:     store = pd.HDFStore('test.h5',mode='w',driver='H5FD_CORE',driver_core_backing_store=0)
   ...:     store['df'] = x
   ...:     store.close()
   ...:     

In [11]: df = pd.DataFrame(np.random.rand(100000, 100))

In [13]: %timeit -n 5 buf = pack(df)
5 loops, best of 3: 202 ms per loop

In [14]: %timeit -n 5 buf = pkl.dumps(df,pkl.HIGHEST_PROTOCOL)
5 loops, best of 3: 115 ms per loop

In [15]: %timeit -n 5 f(df)
5 loops, best of 3: 53.9 ms per loop

@jreback
Copy link
Contributor Author

jreback commented Nov 12, 2013

pretty competetive now with pickle (and better for smaller cases)

@wesm going to throw this in 0.13

API: disable sparse structure encodings and unicode indexes
@dragoljub
Copy link

Very excited about this. I'm using msgpackRPC for some remote Analytics. Looks perfect.

@jreback
Copy link
Contributor Author

jreback commented Nov 12, 2013

@dragoljub gr8!

should df.to_msgpack() just return a string? (like to_json does)

right now it will complain as it needs a file

jreback added a commit that referenced this pull request Nov 13, 2013
PERF: msgpack encoding changes to use to/from string for speed boosts
@jreback jreback merged commit 3239b29 into pandas-dev:master Nov 13, 2013
@jreback
Copy link
Contributor Author

jreback commented Nov 13, 2013

I think I can allow that, the issue is that we allow this too:

pd.to_msgpack(path_or_buf, *args)
I have to have a None here to return a string

but df.to_msgpack() is easy

@wesm
Copy link
Member

wesm commented Nov 14, 2013

Holy moly no wonder it was slow before.

@jreback
Copy link
Contributor Author

jreback commented Nov 14, 2013

yep...writing a string is the fastest in msgpack; its a len with a single write. while writing an array is quite time consuming because its a separate call for each number (even though its exactly the same amount of data), just encoded differently. numpy must have a pretty fast tostring

@androane
Copy link

androane commented Oct 26, 2021

How do you unpack?

from pandas.io.packers import pack, unpack

unpack(pack(df))

doesn't work: AttributeError: 'bytes' object has no attribute 'read'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants