ENH: always make ndarrays from msgpack writable #12359

llllllllll · 2016-02-16T23:55:01Z

Addresses the case where 'compress' was not none. The old implementation
would decompress the data and then call np.frombuffer on a bytes
object. Because a bytes object is not a mutable buffer, the resulting
ndarray had writeable=False. The new implementation ensures that the
pandas is the only owner of this new buffer and then sets it to mutable
without copying it. This means that we don't need to do a copy of the
data coming in AND we can mutate it later. If we are not the only owner
of this data then we just copy it with np.fromstring.

jreback · 2016-02-16T23:56:56Z

pandas/io/packers.py

+            raise ValueError("compress must be one of 'zlib' or 'blosc'")
+
+        values = decompress(values)
+        if sys.getrefcount(values) == 2:


where did you get this beauty?

I was just playing around with frombuffer and thinking of ways to avoid the copy here.

Why can't we always just use np.frombuffer(buffer(values), dtype=....)?

then just set the writeable flag?

The issue ius that this we will be working with memory allocated for a bytes. For example:

In [1]: buf = bytes(range(4)) In [2]: np.frombuffer(buf, dtype='uint8') Out[2]: array([0, 1, 2, 3], dtype=uint8) In [3]: _2.flags.writeable = True In [4]: _2[0] = 5 In [5]: buf Out[5]: b'\x05\x01\x02\x03'

The only case where we can just use the buffer is when we are the sole owner of values and no one can reference values again. In the other cases we need to just copy the data so we can mutate while preserving normal python semantics of immutable strings.

I c. Is this really a bottleneck, e.g. the copying? Does the compression help?

The copying might be restrictive if you are unpacking a large dataframe because your high watermark is now 2n. We are unpacking data on a shared box so I would like to keep the memory usage low if possible. I also imagine that users who are using compression are recieving a large amount of data.

I actually thing the ideal fix would be for blosc and zlib to have unpack functions that give back bytearrays so that we can do this in a more safe way. I will investigate if this is already available.

it looks like there is no good way to get the compressors to return bytearrays in their current state.

The implementation proposed in this PR is scary so I tried to solve this problem upstream in blosc and zlib

Hopefully we can swap to the more supported solutions when available and only use this hack when we need to.

jreback · 2016-02-16T23:57:08Z

cc @kawochen

jreback · 2016-02-17T00:33:05Z

pandas/io/tests/test_packers.py

@@ -57,6 +59,20 @@ def check_arbitrary(a, b):
        assert(a == b)


+@contextmanager


isnt't this like a unittest.Mock object? (well what you are doing below)

I thought that was only added to the stdlib in py3

kawochen · 2016-02-18T16:39:46Z

python magic! there might be performance warnings during debugging when the debugger holds more references

llllllllll · 2016-02-18T21:51:40Z

I think that warning while debugging is unfortunate but correct. Do you think that it would be worth amending the warning to add "or you are in a debugger"?

kawochen · 2016-02-22T21:18:28Z

I'm not sure what the best thing to do is. In general I believe relying on ref count is sketchy (e.g. count could be incremented between function calls). But then I think zlib doesn't really return anything that's shared, even for small strings (I didn't manage to get any performance warnings with crafted cases).

I'm only minimally familiar with the C-API. Is it possible that, although the ref count is 2, the string is interned during the query (maybe by getrefcount itself) and never destroyed, so subsequent attempts to use that string give unexpected results?

In moving data around, this n vs 2n memory issue is very common, but I think it is a legit concern, and in this case it does seem a bit wasteful.

llllllllll · 2016-02-22T21:29:08Z

Subsequent accesses to the bytes object will have unexpected results. Think of this as a move operation. I have checked the source for zlib and blosc to confirm that in their current state they will not be interning the strings; however, the test cases are to guard against changes to the implementation of these libraries. I also have open PRs upstream to both CPython and blosc that will make these functions return mutable buffers. We will need to support this hack for some time (especially with zlib) but at least there will be a more supported code path. I have an idea for a function that will factor out this logic a bit and ensure that users cannot get a reference back to the invalid bytes object. Right now this object is still exposed as df._data.blocks[n].values.base.base which is not ideal but also pretty deeply hidden/

llllllllll · 2016-02-22T22:26:06Z

@kawochen I am updating the PR with a new implementation that I think is both safer and faster (or at least the same speed).

With this new implementation:

In [34]: pd.read_msgpack(df.to_msgpack(compress='blosc'))._data.blocks[0].values.base.base
Out[34]: <memory at 0x7f71f8e81750>

However with the old implementation the base was the bytes object that shared a buffer with the array.

Here are profiling runs to show the differences:

These cells are used for all three profiling runs.

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: packed = df.to_msgpack(compress='blosc')

Using `np.fromstring` to copy the data (this is just shown to make it clear why we cannot do this):

```python
In [3]: %timeit pd.read_msgpack(packed)
10 loops, best of 3: 168 ms per loop

Old implementation:

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.06 ms per loop

Updated implementation:

In [5]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 2.68 ms per loop

jreback · 2016-02-22T22:32:53Z

pandas/io/packers.py

+
+    Notes
+    -----
+    If you want to use this function you are probably wrong.


gr8 comment!

llllllllll · 2016-02-22T22:39:28Z

Ugh, I forgot that memoryview has an obj attribute which is exactly like ndarrays base attribute. I am investigating ways to drop this in pure python but I am starting to think that we might just need to use a little ctypes to kill this object.

jreback · 2016-02-22T22:42:05Z

This is on master (not your branch)

In [23]: %timeit pd.read_pickle('test.pkl')
100 loops, best of 3: 13.7 ms per loop

# zlib
[22]: %timeit pd.read_msgpack('test.pak')
10 loops, best of 3: 144 ms per loop

# blosc
In [3]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 3.19 ms per loop

# uncompressed
In [5]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 18.5 ms per loop

blosc >> zlib

jreback · 2016-02-23T16:55:23Z

@llllllllll

This is on 0.17.1

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: In [2]: packed = df.to_msgpack(compress='blosc')

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 4.97 ms per loop

How did you get the above?

jreback · 2016-02-23T17:06:35Z

I am amazed on how fast blosc is. we should turn this on by default if blosc is available.

llllllllll · 2016-02-23T17:31:01Z

sorry, the really slow run was what would happen if we copied the data after doing blosc.decompress. This code was never on master I just wanted to show that it was worthwhile to use a more complex approach to avoid this copy.

llllllllll · 2016-02-25T07:49:01Z

ping, how do you feel about this?

kawochen · 2016-02-26T02:32:19Z

no trace of bytes left 👍

llllllllll · 2016-02-26T02:59:32Z

hype

jreback · 2016-02-26T03:08:11Z

what triggers the error on master?

jreback · 2016-03-17T19:41:40Z

pandas/io/packers.py

-        values = blosc.decompress(values)
-        return np.frombuffer(values, dtype=dtype)
+    if compress:
+        if compress == u'zlib':


I would try to import these at the top of the file and just set variables (_ZLIB_INSTALLED, _BLOSC_INSTALLED); and you can do whatever imports you need there.

done, added _check_* functions to raise a valueerror here if you try to use a compressor that is not installed

llllllllll · 2016-03-17T20:26:14Z

Updated based on feedback. I also added standalone tests for move_into_mutable_buffer because it is not in util.

jreback · 2016-03-17T20:53:15Z

IIRC this DID work on windows before.

[pandas3] C:\Users\conda\Documents\pandas3.5>python setup.py build_ext --inplace
running build_ext
skipping 'pandas\io/sas\saslib.c' Cython extension (up-to-date)
skipping 'pandas\src\period.c' Cython extension (up-to-date)
skipping 'pandas\hashtable.c' Cython extension (up-to-date)
skipping 'pandas\algos.c' Cython extension (up-to-date)
skipping 'pandas\index.c' Cython extension (up-to-date)
skipping 'pandas\parser.c' Cython extension (up-to-date)
skipping 'pandas\lib.c' Cython extension (up-to-date)
skipping 'pandas\tslib.c' Cython extension (up-to-date)
skipping 'pandas\src\sparse.c' Cython extension (up-to-date)
skipping 'pandas\src\testing.c' Cython extension (up-to-date)
skipping 'pandas\msgpack\_packer.cpp' Cython extension (up-to-date)
skipping 'pandas\msgpack\_unpacker.cpp' Cython extension (up-to-date)
building 'pandas.util._move' extension
creating build\temp.win-amd64-3.5\Release\pandas\util
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Miniconda2\envs\pandas3.5\lib\site-pa
ckages\numpy\core\include -IC:\Miniconda2\envs\pandas3.5\include -IC:\Miniconda2\envs\pandas3.5\include "-IC:\Program Files (x86)\Microsoft Visual Stu
dio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.
10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "
-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpandas/util
/_move.c /Fobuild\temp.win-amd64-3.5\Release\pandas/util/_move.obj
_move.c
pandas/util/_move.c(201): error C2099: initializer is not a constant
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\amd64\\cl.exe' failed with exit status 2

[pandas3] C:\Users\conda\Documents\pandas3.5>

llllllllll · 2016-03-17T21:04:30Z

That is saying that my definition for stolenbuf_type is not constant but it only uses addesses and int constants, is this using PIC? is there a way to get the compiler to emit the particular field that it does not think is constant? GCC by default will tell you which field it doesn't like.

jreback · 2016-03-17T21:11:23Z

https://msdn.microsoft.com/en-us/library/t801az8a.aspx

Addresses the case where 'compress' was not none. The old implementation would decompress the data and then call np.frombuffer on a bytes object. Because a bytes object is not a mutable buffer, the resulting ndarray had writeable=False. The new implementation ensures that the pandas is the only owner of this new buffer and then sets it to mutable without copying it. This means that we don't need to do a copy of the data coming in AND we can mutate it later. If we are not the only owner of this data then we just copy it with np.fromstring.

In many capi functions for python the empty string and single characters are memoized to return preallocated bytes objects. This could be a potential problem with the implementation of pandas.io.packers.unconvert so we are adding a test for this explicitly. Currently neither zlib nor blosc hit this case because they use PyBytes_FromStringAndSize(NULL, n) which does not pull from the shared pool of bytes objects; however, this test will guard us against changes to this implementation detail in the future.

Pulls the logic that does the buffer management into a new private function named `_move_into_mutable_buffer`. The function acts like a move constructor from 'bytes' to `memoryview`. We now only call `np.frombuffer` on the resulting `memoryview` object. This accomplishes the original task of getting a mutable array from a `bytes` object without copying with the added safety of making the `base` field of the array that is returned to the user a mutable `memoryview`. This eliminates the risk of users obtaining the original `bytes` object which is now pointing to a mutable buffer and violating the immutability contract of `bytes` objects.

By writing our move function in C we can hide the original bytes object from the user while still ensuring that the lifetime is managed correctly. This implementation is designed to make it impossible to get access to the invalid bytes object from pure python.

This function temporarily replaces a an attribute on an object for the duration of some context manager. This is useful for patching methods to inject assertions or mock out expensive operations.

llllllllll · 2016-03-17T23:43:35Z

Read some docs about windows dev and pushed a potential fix. Things still behave normally for me with gcc on gnu+linux, hopefully this builds on windows. Do you have an appveyor build I can check?

jreback · 2016-03-17T23:48:26Z

you could set this up with appveyor (just create an account), all of the setup will build on your PR's. but its VERY slow. I do with a local vm.

jreback · 2016-03-17T23:52:02Z

pandas/tests/test_util.py

+class TestMove(tm.TestCase):
+    def test_more_than_one_ref(self):
+        """Test case for when we try to use ``move_into_mutable_buffer`` when
+        the object being moved has other references.


its funny, when you use a doc-string it calls the test by this. I guess that is right.

Should I just make this a comment instead or do you not mind?

jreback · 2016-03-17T23:53:05Z

ok this passed for me on windows!

merging

jreback · 2016-03-17T23:54:02Z

side issue, should prob add some more perf tests with msgpack to cover the bases.

llllllllll · 2016-03-17T23:56:01Z

Should I do that in this pr or open another pr with perf tests? Also are those in test_perf.sh?

jreback · 2016-03-17T23:57:30Z

This is amazing how fast blosc makes this.

no we are using asv now. (tests are basically in the same place though). another PR is fine.

In [8]:  %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.45 ms per loop

In [9]: %timeit df.to_msgpack(compress='blosc')
100 loops, best of 3: 6.27 ms per loop

In [10]: %timeit df.to_msgpack()
100 loops, best of 3: 14.4 ms per loop

In [11]: packed = df.to_msgpack()

In [12]:  %timeit pd.read_msgpack(packed)
10 loops, best of 3: 20.1 ms per loop

jreback · 2016-03-18T00:01:50Z

@llllllllll thanks!!

jreback · 2016-03-18T13:34:46Z

had to make a trivial modification: jreback@6bba06c

we don't have blosc installed (on purpose) on some of the windows builds. I guess this actually failed travis (but I merged before) :<

jreback · 2016-03-18T13:44:08Z

only fails on 2.7 which we were testing with blosc installed. fixed up.

llllllllll · 2016-03-18T16:06:36Z

oh, sorry about that. I guess import error is probably more indicative of the problem

jreback reviewed Feb 16, 2016
View reviewed changes

llllllllll force-pushed the msgpack-owns-new-memory branch from c01925a to 2a34d40 Compare February 16, 2016 23:57

jreback added Msgpack Performance Memory or execution speed performance labels Feb 17, 2016

jreback reviewed Feb 17, 2016
View reviewed changes

llllllllll mentioned this pull request Feb 17, 2016

Add as_bytearray=False flag to decompress. Blosc/python-blosc#107

Merged

jreback reviewed Feb 22, 2016
View reviewed changes

pandas/io/packers.py

Notes

-----

If you want to use this function you are probably wrong.

Copy link

Contributor

jreback Feb 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gr8 comment!

llllllllll force-pushed the msgpack-owns-new-memory branch 6 times, most recently from e513912 to 9c4a29b Compare February 23, 2016 08:34

llllllllll force-pushed the msgpack-owns-new-memory branch from 9c4a29b to 8a23b22 Compare February 23, 2016 18:44

jreback reviewed Mar 17, 2016
View reviewed changes

llllllllll force-pushed the msgpack-owns-new-memory branch from 8a23b22 to 7be12a1 Compare March 17, 2016 20:22

llllllllll force-pushed the msgpack-owns-new-memory branch from 7be12a1 to 7d6d545 Compare March 17, 2016 23:40

Joe Jevnik added 3 commits March 17, 2016 19:41

llllllllll force-pushed the msgpack-owns-new-memory branch from 7d6d545 to e896603 Compare March 17, 2016 23:42

Joe Jevnik added 2 commits March 17, 2016 19:42

TST: adds pandas.util.testing.patch

e896603

This function temporarily replaces a an attribute on an object for the duration of some context manager. This is useful for patching methods to inject assertions or mock out expensive operations.

jreback reviewed Mar 17, 2016
View reviewed changes

jreback closed this in 28ba08c Mar 18, 2016

		@@ -57,6 +59,20 @@ def check_arbitrary(a, b):
		assert(a == b)


		@contextmanager

ENH: always make ndarrays from msgpack writable #12359

ENH: always make ndarrays from msgpack writable #12359

Conversation

llllllllll commented Feb 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kawochen commented Feb 18, 2016

llllllllll commented Feb 18, 2016

kawochen commented Feb 22, 2016

llllllllll commented Feb 22, 2016

llllllllll commented Feb 22, 2016

Choose a reason for hiding this comment

llllllllll commented Feb 22, 2016

jreback commented Feb 22, 2016

jreback commented Feb 23, 2016

jreback commented Feb 23, 2016

llllllllll commented Feb 23, 2016

llllllllll commented Feb 25, 2016

kawochen commented Feb 26, 2016

llllllllll commented Feb 26, 2016

jreback commented Feb 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llllllllll commented Mar 17, 2016

jreback commented Mar 17, 2016

llllllllll commented Mar 17, 2016

jreback commented Mar 17, 2016

llllllllll commented Mar 17, 2016

jreback commented Mar 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 17, 2016

jreback commented Mar 17, 2016

llllllllll commented Mar 17, 2016

jreback commented Mar 17, 2016

jreback commented Mar 18, 2016

jreback commented Mar 18, 2016

jreback commented Mar 18, 2016

llllllllll commented Mar 18, 2016