Skip to content

add compression support for 'read_pickle' and 'to_pickle' #13317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
74 changes: 33 additions & 41 deletions pandas/tests/io/test_pickle.py
Original file line number Diff line number Diff line change
Expand Up @@ -304,47 +304,39 @@ def test_pickle_v0_15_2():
tm.assert_categorical_equal(cat, pd.read_pickle(pickle_path))


class TestPickleCompression(object):

def setup_class(self):
self.path = u'__%s__.pickle' % tm.rands(10)

def compression_explicit(self, compression):
# issue 11666
if compression == 'xz':
tm._skip_if_no_lzma()
with tm.ensure_clean(self.path) as path:
# ---------------------
# test pickle compression
# ---------------------
def get_random_path():
return u'__%s__.pickle' % tm.rands(10)


@pytest.mark.parametrize('compression', [None, 'gzip', 'bz2', 'xz'])
def test_compression_explicit(compression):
# issue 11666
if compression == 'xz':
tm._skip_if_no_lzma()
with tm.ensure_clean(get_random_path()) as path:
df = tm.makeDataFrame()
df.to_pickle(path, compression=compression)
df2 = pd.read_pickle(path, compression=compression)
tm.assert_frame_equal(df, df2)


@pytest.mark.parametrize('compression', ['', 'None', 'bad', '7z'])
def test_compression_explicit_bad(compression):
with tm.assertRaisesRegexp(ValueError,
"Unrecognized compression type"):
with tm.ensure_clean(get_random_path()) as path:
df = tm.makeDataFrame()
df.to_pickle(path, compression=compression)
df2 = pd.read_pickle(path, compression=compression)
tm.assert_frame_equal(df, df2)

def test_compression_explicit(self):
compressions = [None, 'gzip', 'bz2', 'xz']
for c in compressions:
yield self.compression_explicit, c

def compression_explicit_bad(self, compression):
with tm.assertRaisesRegexp(ValueError,
"Unrecognized compression type"):
with tm.ensure_clean(self.path) as path:
df = tm.makeDataFrame()
df.to_pickle(path, compression=compression)

def test_compression_explicit_bad(self):
compressions = ['', 'None', 'bad', '7z']
for c in compressions:
yield self.compression_explicit_bad, c

def compression_infer(self, ext):
if ext == '.xz':
tm._skip_if_no_lzma()
with tm.ensure_clean(self.path + ext) as path:
df = tm.makeDataFrame()
df.to_pickle(path)
tm.assert_frame_equal(df, pd.read_pickle(path))

def test_compression_infer(self):
extensions = ['', '.gz', '.bz2', '.xz', '.no_compress']
for ext in extensions:
yield self.compression_infer, ext

@pytest.mark.parametrize('ext', ['', '.gz', '.bz2', '.xz', '.no_compress'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to test a .pkl extension, as this will be the most common (and shouldn't be compressed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.Is there anything official that says .pkl is a pickle extension ? (though by definition .pkl would NOT be compressed, while .pkl.gz would be for example). Also let's have .gzip the same as .gz. I think this is common.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension for pickled files is not standardized. The common ones are .pkl, .p, and .pickle.

Also let's have .gzip the same as .gz

Currently, we don't infer .gzip extension as gzip compression. I think that's out of scope for this PR. Something to consider for the future. Perhaps we could outsource inference to mimetypes.guess_type's encoding detection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhimmel good point. can you create an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, I'm thinking this isn't something that I would advocate for. Here is what mimetypes does:

import mimetypes
print(mimetypes.guess_type('file-name.tsv.gz'))
print(mimetypes.guess_type('file-name.tsv.gzip'))

outputs:

('text/tab-separated-values', 'gzip')
(None, None)

So if we use mimetypes, the .gzip extension won't be recognized. We could code our compression inference to recognize .gzip and .gz, but I don't think that would be the right decision, since you could then write gzip compressed files that end in .gzip using inference. I'd rather stick with the extensions recognized by mimetypes.

If you'd still like to discuss this further via a dedicated issue, just let me know and I'll post it... with the caveat I'm not advocating for it 😼.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, specific extensions are prob fine.

def test_compression_infer(ext):
if ext == '.xz':
tm._skip_if_no_lzma()
with tm.ensure_clean(get_random_path() + ext) as path:
df = tm.makeDataFrame()
df.to_pickle(path)
tm.assert_frame_equal(df, pd.read_pickle(path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests just test that the round trip works, but not that inference works.

Here's some pseudo-code to more explicitly test inference.

with tm.ensure_clean(get_random_path() + ext) as infer_path, tm.ensure_clean(get_random_path() + ext) as explicit_path:
    df = tm.makeDataFrame()
    df.to_pickle(infer_path)
    df.to_pickle(explicit_path, compression=compression)
    # Implement: Assert files equal
    tm.assert_frame_equal(df, pd.read_pickle(explicit_path))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should add a .pkl.gz file to travis itself and test with it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually, all write-then-read operations should split into three steps:

  1. write to a file1, compressed or uncompressed
  2. compress or decompress file1 into file2 using external util or a standalone piece of code
  3. read file2

Then compare content from file2 with that written to file1. Seems I need to re-write all the tests.