-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: always make ndarrays from msgpack writable #12359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
raise ValueError("compress must be one of 'zlib' or 'blosc'") | ||
|
||
values = decompress(values) | ||
if sys.getrefcount(values) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did you get this beauty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just playing around with frombuffer and thinking of ways to avoid the copy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we always just use np.frombuffer(buffer(values), dtype=....)
?
then just set the writeable flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue ius that this we will be working with memory allocated for a bytes
. For example:
In [1]: buf = bytes(range(4))
In [2]: np.frombuffer(buf, dtype='uint8')
Out[2]: array([0, 1, 2, 3], dtype=uint8)
In [3]: _2.flags.writeable = True
In [4]: _2[0] = 5
In [5]: buf
Out[5]: b'\x05\x01\x02\x03'
The only case where we can just use the buffer is when we are the sole owner of values and no one can reference values
again. In the other cases we need to just copy the data so we can mutate while preserving normal python semantics of immutable strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I c. Is this really a bottleneck, e.g. the copying? Does the compression help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copying might be restrictive if you are unpacking a large dataframe because your high watermark is now 2n. We are unpacking data on a shared box so I would like to keep the memory usage low if possible. I also imagine that users who are using compression are recieving a large amount of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thing the ideal fix would be for blosc and zlib to have unpack functions that give back bytearrays so that we can do this in a more safe way. I will investigate if this is already available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like there is no good way to get the compressors to return bytearrays in their current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c01925a
to
2a34d40
Compare
cc @kawochen |
@@ -57,6 +59,20 @@ def check_arbitrary(a, b): | |||
assert(a == b) | |||
|
|||
|
|||
@contextmanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt't this like a unittest.Mock
object? (well what you are doing below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that was only added to the stdlib in py3
python magic! there might be performance warnings during debugging when the debugger holds more references |
I think that warning while debugging is unfortunate but correct. Do you think that it would be worth amending the warning to add "or you are in a debugger"? |
I'm not sure what the best thing to do is. In general I believe relying on ref count is sketchy (e.g. count could be incremented between function calls). But then I think I'm only minimally familiar with the C-API. Is it possible that, although the ref count is 2, the string is interned during the query (maybe by In moving data around, this |
Subsequent accesses to the |
@kawochen I am updating the PR with a new implementation that I think is both safer and faster (or at least the same speed). With this new implementation: In [34]: pd.read_msgpack(df.to_msgpack(compress='blosc'))._data.blocks[0].values.base.base
Out[34]: <memory at 0x7f71f8e81750> However with the old implementation the base was the Here are profiling runs to show the differences: These cells are used for all three profiling runs. In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})
In [2]: packed = df.to_msgpack(compress='blosc')
Using `np.fromstring` to copy the data (this is just shown to make it clear why we cannot do this):
```python
In [3]: %timeit pd.read_msgpack(packed)
10 loops, best of 3: 168 ms per loop Old implementation: In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.06 ms per loop Updated implementation: In [5]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 2.68 ms per loop |
|
||
Notes | ||
----- | ||
If you want to use this function you are probably wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gr8 comment!
Ugh, I forgot that |
This is on master (not your branch)
blosc >> zlib |
e513912
to
9c4a29b
Compare
This is on 0.17.1
How did you get the above? |
I am amazed on how fast |
sorry, the really slow run was what would happen if we copied the data after doing |
9c4a29b
to
8a23b22
Compare
ping, how do you feel about this? |
no trace of |
hype |
what triggers the error on master? |
values = blosc.decompress(values) | ||
return np.frombuffer(values, dtype=dtype) | ||
if compress: | ||
if compress == u'zlib': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to import these at the top of the file and just set variables (_ZLIB_INSTALLED
, _BLOSC_INSTALLED
); and you can do whatever imports you need there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, added _check_*
functions to raise a valueerror here if you try to use a compressor that is not installed
8a23b22
to
7be12a1
Compare
Updated based on feedback. I also added standalone tests for |
IIRC this DID work on windows before.
|
That is saying that my definition for |
7be12a1
to
7d6d545
Compare
Addresses the case where 'compress' was not none. The old implementation would decompress the data and then call np.frombuffer on a bytes object. Because a bytes object is not a mutable buffer, the resulting ndarray had writeable=False. The new implementation ensures that the pandas is the only owner of this new buffer and then sets it to mutable without copying it. This means that we don't need to do a copy of the data coming in AND we can mutate it later. If we are not the only owner of this data then we just copy it with np.fromstring.
In many capi functions for python the empty string and single characters are memoized to return preallocated bytes objects. This could be a potential problem with the implementation of pandas.io.packers.unconvert so we are adding a test for this explicitly. Currently neither zlib nor blosc hit this case because they use PyBytes_FromStringAndSize(NULL, n) which does not pull from the shared pool of bytes objects; however, this test will guard us against changes to this implementation detail in the future.
Pulls the logic that does the buffer management into a new private function named `_move_into_mutable_buffer`. The function acts like a move constructor from 'bytes' to `memoryview`. We now only call `np.frombuffer` on the resulting `memoryview` object. This accomplishes the original task of getting a mutable array from a `bytes` object without copying with the added safety of making the `base` field of the array that is returned to the user a mutable `memoryview`. This eliminates the risk of users obtaining the original `bytes` object which is now pointing to a mutable buffer and violating the immutability contract of `bytes` objects.
7d6d545
to
e896603
Compare
By writing our move function in C we can hide the original bytes object from the user while still ensuring that the lifetime is managed correctly. This implementation is designed to make it impossible to get access to the invalid bytes object from pure python.
This function temporarily replaces a an attribute on an object for the duration of some context manager. This is useful for patching methods to inject assertions or mock out expensive operations.
Read some docs about windows dev and pushed a potential fix. Things still behave normally for me with gcc on gnu+linux, hopefully this builds on windows. Do you have an appveyor build I can check? |
you could set this up with appveyor (just create an account), all of the setup will build on your PR's. but its VERY slow. I do with a local vm. |
class TestMove(tm.TestCase): | ||
def test_more_than_one_ref(self): | ||
"""Test case for when we try to use ``move_into_mutable_buffer`` when | ||
the object being moved has other references. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its funny, when you use a doc-string it calls the test by this. I guess that is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I just make this a comment instead or do you not mind?
ok this passed for me on windows! merging |
side issue, should prob add some more perf tests with msgpack to cover the bases. |
Should I do that in this pr or open another pr with perf tests? Also are those in |
This is amazing how fast blosc makes this. no we are using
|
@llllllllll thanks!! |
had to make a trivial modification: jreback@6bba06c we don't have blosc installed (on purpose) on some of the windows builds. I guess this actually failed travis (but I merged before) :< |
only fails on 2.7 which we were testing with blosc installed. fixed up. |
oh, sorry about that. I guess import error is probably more indicative of the problem |
Addresses the case where 'compress' was not none. The old implementation
would decompress the data and then call np.frombuffer on a bytes
object. Because a bytes object is not a mutable buffer, the resulting
ndarray had writeable=False. The new implementation ensures that the
pandas is the only owner of this new buffer and then sets it to mutable
without copying it. This means that we don't need to do a copy of the
data coming in AND we can mutate it later. If we are not the only owner
of this data then we just copy it with np.fromstring.