You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently to_pickle has the following code to workaround issue ( #39002 ). The result is an extra in-memory copy with pickle protocol 5 for some formats.
The reason the file compressors fail is they often assume something that is bytes-like (though not as general as a memoryview). In particular they assume the data is 1-D contiguous and of uint8 type. With PickleBuffer's raw method, it is pretty straightforward to construct a memoryview with this format. If the buffer is non-contiguous (not sure how often this would come up here), raw will raise a BufferError. Though this can be mitigated by falling back to an in-memory copy if that exception occurs.
Given this, one option would be to wrap the write method of the compressor file objects that need this handling. For example the following would work on Python 3.8+.
frombz2importBZ2Fileas_BZ2FilefrompickleimportPickleBufferclassBZ2File(_BZ2File):
defwrite(self, b):
ifisinstance(b, PickleBuffer):
try:
b=b.raw() # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copyexceptBufferError:
b=bytes(b) # perform in-memory copy if buffer is not contiguousreturnsuper(BZ2File, self).write(b)
Potentially this could live alongside other custom file objects like this one in pandas.io.common (albeit as private objects). Though maybe there are other places that could make sense.
API breaking implications
NA
Already when protocol 5 is set data with that protocol is written out in memory before writing to the file. This would only save the memcpy before writing to the file. Data written before and after this change should still be readable just the same.
Describe alternatives you've considered
Ultimately it would be preferable to have this fixed upstream. In fact to some extent that has already happened ( python/cpython#88605 ). However Python 3.8 is not covered, which is still supported by Pandas. Also if users are stuck on an earlier patch version of 3.9 (only 3.9.6+ has the fix), they may not have the fix. In these cases, it may make sense to have this workaround to provide the improved efficiency while protecting against this issue.
Additional context
NA (included above)
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
Currently
to_pickle
has the following code to workaround issue ( #39002 ). The result is an extra in-memory copy with pickle protocol 5 for some formats.pandas/pandas/io/pickle.py
Lines 104 to 109 in c90294d
Describe the solution you'd like
The reason the file compressors fail is they often assume something that is
bytes
-like (though not as general as amemoryview
). In particular they assume the data is 1-D contiguous and ofuint8
type. WithPickleBuffer
'sraw
method, it is pretty straightforward to construct amemoryview
with this format. If the buffer is non-contiguous (not sure how often this would come up here),raw
will raise aBufferError
. Though this can be mitigated by falling back to an in-memory copy if that exception occurs.Given this, one option would be to wrap the
write
method of the compressor file objects that need this handling. For example the following would work on Python 3.8+.Potentially this could live alongside other custom file objects like this one in
pandas.io.common
(albeit as private objects). Though maybe there are other places that could make sense.API breaking implications
NA
Already when protocol 5 is set data with that protocol is written out in memory before writing to the file. This would only save the
memcpy
before writing to the file. Data written before and after this change should still be readable just the same.Describe alternatives you've considered
Ultimately it would be preferable to have this fixed upstream. In fact to some extent that has already happened ( python/cpython#88605 ). However Python 3.8 is not covered, which is still supported by Pandas. Also if users are stuck on an earlier patch version of 3.9 (only 3.9.6+ has the fix), they may not have the fix. In these cases, it may make sense to have this workaround to provide the improved efficiency while protecting against this issue.
Additional context
NA (included above)
The text was updated successfully, but these errors were encountered: