ENH: Always write directly to output in `to_pickle` #46747

jakirkham · 2022-04-12T08:49:22Z

Is your feature request related to a problem?

Currently to_pickle has the following code to workaround issue ( #39002 ). The result is an extra in-memory copy with pickle protocol 5 for some formats.

pandas/pandas/io/pickle.py

Lines 104 to 109 in c90294d

    
           if handles.compression["method"] in ("bz2", "xz") and protocol >= 5: 
        
               # some weird TypeError GH#39002 with pickle 5: fallback to letting 
        
               # pickle create the entire object and then write it to the buffer. 
        
               # "zip" would also be here if pandas.io.common._BytesZipFile 
        
               # wouldn't buffer write calls 
        
               handles.handle.write(pickle.dumps(obj, protocol=protocol))

Describe the solution you'd like

The reason the file compressors fail is they often assume something that is bytes-like (though not as general as a memoryview). In particular they assume the data is 1-D contiguous and of uint8 type. With PickleBuffer's raw method, it is pretty straightforward to construct a memoryview with this format. If the buffer is non-contiguous (not sure how often this would come up here), raw will raise a BufferError. Though this can be mitigated by falling back to an in-memory copy if that exception occurs.

Given this, one option would be to wrap the write method of the compressor file objects that need this handling. For example the following would work on Python 3.8+.

from bz2 import BZ2File as _BZ2File
from pickle import PickleBuffer


class BZ2File(_BZ2File):
    def write(self, b):
        if isinstance(b, PickleBuffer):
            try:
                b = b.raw()  # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copy
            except BufferError:
                b = bytes(b)  # perform in-memory copy if buffer is not contiguous
        return super(BZ2File, self).write(b)

Potentially this could live alongside other custom file objects like this one in pandas.io.common (albeit as private objects). Though maybe there are other places that could make sense.

API breaking implications

NA

Already when protocol 5 is set data with that protocol is written out in memory before writing to the file. This would only save the memcpy before writing to the file. Data written before and after this change should still be readable just the same.

Describe alternatives you've considered

Ultimately it would be preferable to have this fixed upstream. In fact to some extent that has already happened ( python/cpython#88605 ). However Python 3.8 is not covered, which is still supported by Pandas. Also if users are stuck on an earlier patch version of 3.9 (only 3.9.6+ has the fix), they may not have the fix. In these cases, it may make sense to have this workaround to provide the improved efficiency while protecting against this issue.

Additional context

NA (included above)

The text was updated successfully, but these errors were encountered:

jakirkham added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2022

jakirkham mentioned this issue Apr 12, 2022

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

Closed

3 tasks

simonjayhawkins added the IO Pickle read_pickle, to_pickle label Apr 12, 2022

mroeschke added Python 3.8 and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022

mroeschke added the Performance Memory or execution speed performance label Aug 11, 2022

jakirkham mentioned this issue Oct 13, 2022

PERF: Improve pickle support with BZ2 & LZMA #49068

Merged

5 tasks

mroeschke closed this as completed in #49068 Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Always write directly to output in `to_pickle` #46747

ENH: Always write directly to output in `to_pickle` #46747

jakirkham commented Apr 12, 2022 •

edited

Loading

ENH: Always write directly to output in to_pickle #46747

ENH: Always write directly to output in to_pickle #46747

Comments

jakirkham commented Apr 12, 2022 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

ENH: Always write directly to output in `to_pickle` #46747

ENH: Always write directly to output in `to_pickle` #46747

jakirkham commented Apr 12, 2022 •

edited

Loading