Skip to content

ENH: Always write directly to output in to_pickle #46747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakirkham opened this issue Apr 12, 2022 · 0 comments · Fixed by #49068
Closed

ENH: Always write directly to output in to_pickle #46747

jakirkham opened this issue Apr 12, 2022 · 0 comments · Fixed by #49068
Labels
Enhancement IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance

Comments

@jakirkham
Copy link
Contributor

jakirkham commented Apr 12, 2022

Is your feature request related to a problem?

Currently to_pickle has the following code to workaround issue ( #39002 ). The result is an extra in-memory copy with pickle protocol 5 for some formats.

pandas/pandas/io/pickle.py

Lines 104 to 109 in c90294d

if handles.compression["method"] in ("bz2", "xz") and protocol >= 5:
# some weird TypeError GH#39002 with pickle 5: fallback to letting
# pickle create the entire object and then write it to the buffer.
# "zip" would also be here if pandas.io.common._BytesZipFile
# wouldn't buffer write calls
handles.handle.write(pickle.dumps(obj, protocol=protocol))

Describe the solution you'd like

The reason the file compressors fail is they often assume something that is bytes-like (though not as general as a memoryview). In particular they assume the data is 1-D contiguous and of uint8 type. With PickleBuffer's raw method, it is pretty straightforward to construct a memoryview with this format. If the buffer is non-contiguous (not sure how often this would come up here), raw will raise a BufferError. Though this can be mitigated by falling back to an in-memory copy if that exception occurs.

Given this, one option would be to wrap the write method of the compressor file objects that need this handling. For example the following would work on Python 3.8+.

from bz2 import BZ2File as _BZ2File
from pickle import PickleBuffer


class BZ2File(_BZ2File):
    def write(self, b):
        if isinstance(b, PickleBuffer):
            try:
                b = b.raw()  # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copy
            except BufferError:
                b = bytes(b)  # perform in-memory copy if buffer is not contiguous
        return super(BZ2File, self).write(b)

Potentially this could live alongside other custom file objects like this one in pandas.io.common (albeit as private objects). Though maybe there are other places that could make sense.

API breaking implications

NA

Already when protocol 5 is set data with that protocol is written out in memory before writing to the file. This would only save the memcpy before writing to the file. Data written before and after this change should still be readable just the same.

Describe alternatives you've considered

Ultimately it would be preferable to have this fixed upstream. In fact to some extent that has already happened ( python/cpython#88605 ). However Python 3.8 is not covered, which is still supported by Pandas. Also if users are stuck on an earlier patch version of 3.9 (only 3.9.6+ has the fix), they may not have the fix. In these cases, it may make sense to have this workaround to provide the improved efficiency while protecting against this issue.

Additional context

NA (included above)

@jakirkham jakirkham added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2022
@simonjayhawkins simonjayhawkins added the IO Pickle read_pickle, to_pickle label Apr 12, 2022
@mroeschke mroeschke added Python 3.8 and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 6, 2022
@mroeschke mroeschke added the Performance Memory or execution speed performance label Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants