Skip to content

Commit 8800e84

Browse files
linebpjreback
authored andcommitted
GH15943 Fixed defaults for compression in HDF5 (#16355)
1 parent 3caf858 commit 8800e84

File tree

5 files changed

+115
-25
lines changed

5 files changed

+115
-25
lines changed

doc/source/io.rst

+51-13
Original file line numberDiff line numberDiff line change
@@ -4067,26 +4067,64 @@ Compression
40674067
+++++++++++
40684068

40694069
``PyTables`` allows the stored data to be compressed. This applies to
4070-
all kinds of stores, not just tables.
4070+
all kinds of stores, not just tables. Two parameters are used to
4071+
control compression: ``complevel`` and ``complib``.
4072+
4073+
``complevel`` specifies if and how hard data is to be compressed.
4074+
``complevel=0`` and ``complevel=None`` disables
4075+
compression and ``0<complevel<10`` enables compression.
4076+
4077+
``complib`` specifies which compression library to use. If nothing is
4078+
specified the default library ``zlib`` is used. A
4079+
compression library usually optimizes for either good
4080+
compression rates or speed and the results will depend on
4081+
the type of data. Which type of
4082+
compression to choose depends on your specific needs and
4083+
data. The list of supported compression libraries:
4084+
4085+
- `zlib <http://zlib.net/>`_: The default compression library. A classic in terms of compression, achieves good compression rates but is somewhat slow.
4086+
- `lzo <http://www.oberhumer.com/opensource/lzo/>`_: Fast compression and decompression.
4087+
- `bzip2 <http://bzip.org/>`_: Good compression rates.
4088+
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.
4089+
4090+
.. versionadded:: 0.20.2
4091+
4092+
Support for alternative blosc compressors:
4093+
4094+
- `blosc:blosclz <http://www.blosc.org/>`_ This is the
4095+
default compressor for ``blosc``
4096+
- `blosc:lz4
4097+
<https://fastcompression.blogspot.dk/p/lz4.html>`_:
4098+
A compact, very popular and fast compressor.
4099+
- `blosc:lz4hc
4100+
<https://fastcompression.blogspot.dk/p/lz4.html>`_:
4101+
A tweaked version of LZ4, produces better
4102+
compression ratios at the expense of speed.
4103+
- `blosc:snappy <https://google.github.io/snappy/>`_:
4104+
A popular compressor used in many places.
4105+
- `blosc:zlib <http://zlib.net/>`_: A classic;
4106+
somewhat slower than the previous ones, but
4107+
achieving better compression ratios.
4108+
- `blosc:zstd <https://facebook.github.io/zstd/>`_: An
4109+
extremely well balanced codec; it provides the best
4110+
compression ratios among the others above, and at
4111+
reasonably fast speed.
4112+
4113+
If ``complib`` is defined as something other than the
4114+
listed libraries a ``ValueError`` exception is issued.
40714115

4072-
- Pass ``complevel=int`` for a compression level (1-9, with 0 being no
4073-
compression, and the default)
4074-
- Pass ``complib=lib`` where lib is any of ``zlib, bzip2, lzo, blosc`` for
4075-
whichever compression library you prefer.
4116+
.. note::
40764117

4077-
``HDFStore`` will use the file based compression scheme if no overriding
4078-
``complib`` or ``complevel`` options are provided. ``blosc`` offers very
4079-
fast compression, and is my most used. Note that ``lzo`` and ``bzip2``
4080-
may not be installed (by Python) by default.
4118+
If the library specified with the ``complib`` option is missing on your platform,
4119+
compression defaults to ``zlib`` without further ado.
40814120

4082-
Compression for all objects within the file
4121+
Enable compression for all objects within the file:
40834122

40844123
.. code-block:: python
40854124
4086-
store_compressed = pd.HDFStore('store_compressed.h5', complevel=9, complib='blosc')
4125+
store_compressed = pd.HDFStore('store_compressed.h5', complevel=9, complib='blosc:blosclz')
40874126
4088-
Or on-the-fly compression (this only applies to tables). You can turn
4089-
off file compression for a specific table by passing ``complevel=0``
4127+
Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled:
40904128

40914129
.. code-block:: python
40924130

doc/source/whatsnew/v0.21.0.txt

+1-2
Original file line numberDiff line numberDiff line change
@@ -47,13 +47,12 @@ Backwards incompatible API changes
4747

4848
- Support has been dropped for Python 3.4 (:issue:`15251`)
4949
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`)
50-
5150
- Accessing a non-existent attribute on a closed :class:`HDFStore` will now
5251
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
5352
- :func:`read_csv` now treats ``'null'`` strings as missing values by default (:issue:`16471`)
5453
- :func:`read_csv` now treats ``'n/a'`` strings as missing values by default (:issue:`16078`)
55-
5654
- :class:`pandas.HDFStore`'s string representation is now faster and less detailed. For the previous behavior, use ``pandas.HDFStore.info()``. (:issue:`16503`).
55+
- Compression defaults in HDF stores now follow pytable standards. Default is no compression and if ``complib`` is missing and ``complevel`` > 0 ``zlib`` is used (:issue:`15943`)
5756

5857
.. _whatsnew_0210.api:
5958

pandas/core/generic.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1284,10 +1284,10 @@ def to_hdf(self, path_or_buf, key, **kwargs):
12841284
<http://pandas.pydata.org/pandas-docs/stable/io.html#query-via-data-columns>`__.
12851285
12861286
Applicable only to format='table'.
1287-
complevel : int, 0-9, default 0
1287+
complevel : int, 0-9, default None
12881288
Specifies a compression level for data.
12891289
A value of 0 disables compression.
1290-
complib : {'zlib', 'lzo', 'bzip2', 'blosc', None}, default None
1290+
complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib'
12911291
Specifies the compression library to be used.
12921292
As of v0.20.2 these additional compressors for Blosc are supported
12931293
(default if no compressor specified: 'blosc:blosclz'):

pandas/io/pytables.py

+8-8
Original file line numberDiff line numberDiff line change
@@ -411,10 +411,10 @@ class HDFStore(StringMixin):
411411
and if the file does not exist it is created.
412412
``'r+'``
413413
It is similar to ``'a'``, but the file must already exist.
414-
complevel : int, 0-9, default 0
414+
complevel : int, 0-9, default None
415415
Specifies a compression level for data.
416416
A value of 0 disables compression.
417-
complib : {'zlib', 'lzo', 'bzip2', 'blosc', None}, default None
417+
complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib'
418418
Specifies the compression library to be used.
419419
As of v0.20.2 these additional compressors for Blosc are supported
420420
(default if no compressor specified: 'blosc:blosclz'):
@@ -449,12 +449,15 @@ def __init__(self, path, mode=None, complevel=None, complib=None,
449449
"complib only supports {libs} compression.".format(
450450
libs=tables.filters.all_complibs))
451451

452+
if complib is None and complevel is not None:
453+
complib = tables.filters.default_complib
454+
452455
self._path = _stringify_path(path)
453456
if mode is None:
454457
mode = 'a'
455458
self._mode = mode
456459
self._handle = None
457-
self._complevel = complevel
460+
self._complevel = complevel if complevel else 0
458461
self._complib = complib
459462
self._fletcher32 = fletcher32
460463
self._filters = None
@@ -566,11 +569,8 @@ def open(self, mode='a', **kwargs):
566569
if self.is_open:
567570
self.close()
568571

569-
if self._complib is not None:
570-
if self._complevel is None:
571-
self._complevel = 9
572-
self._filters = _tables().Filters(self._complevel,
573-
self._complib,
572+
if self._complevel and self._complevel > 0:
573+
self._filters = _tables().Filters(self._complevel, self._complib,
574574
fletcher32=self._fletcher32)
575575

576576
try:

pandas/tests/io/test_pytables.py

+53
Original file line numberDiff line numberDiff line change
@@ -736,6 +736,59 @@ def test_put_compression_blosc(self):
736736
store.put('c', df, format='table', complib='blosc')
737737
tm.assert_frame_equal(store['c'], df)
738738

739+
def test_complibs_default_settings(self):
740+
# GH15943
741+
df = tm.makeDataFrame()
742+
743+
# Set complevel and check if complib is automatically set to
744+
# default value
745+
with ensure_clean_path(self.path) as tmpfile:
746+
df.to_hdf(tmpfile, 'df', complevel=9)
747+
result = pd.read_hdf(tmpfile, 'df')
748+
tm.assert_frame_equal(result, df)
749+
750+
with tables.open_file(tmpfile, mode='r') as h5file:
751+
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
752+
assert node.filters.complevel == 9
753+
assert node.filters.complib == 'zlib'
754+
755+
# Set complib and check to see if compression is disabled
756+
with ensure_clean_path(self.path) as tmpfile:
757+
df.to_hdf(tmpfile, 'df', complib='zlib')
758+
result = pd.read_hdf(tmpfile, 'df')
759+
tm.assert_frame_equal(result, df)
760+
761+
with tables.open_file(tmpfile, mode='r') as h5file:
762+
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
763+
assert node.filters.complevel == 0
764+
assert node.filters.complib is None
765+
766+
# Check if not setting complib or complevel results in no compression
767+
with ensure_clean_path(self.path) as tmpfile:
768+
df.to_hdf(tmpfile, 'df')
769+
result = pd.read_hdf(tmpfile, 'df')
770+
tm.assert_frame_equal(result, df)
771+
772+
with tables.open_file(tmpfile, mode='r') as h5file:
773+
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
774+
assert node.filters.complevel == 0
775+
assert node.filters.complib is None
776+
777+
# Check if file-defaults can be overridden on a per table basis
778+
with ensure_clean_path(self.path) as tmpfile:
779+
store = pd.HDFStore(tmpfile)
780+
store.append('dfc', df, complevel=9, complib='blosc')
781+
store.append('df', df)
782+
store.close()
783+
784+
with tables.open_file(tmpfile, mode='r') as h5file:
785+
for node in h5file.walk_nodes(where='/df', classname='Leaf'):
786+
assert node.filters.complevel == 0
787+
assert node.filters.complib is None
788+
for node in h5file.walk_nodes(where='/dfc', classname='Leaf'):
789+
assert node.filters.complevel == 9
790+
assert node.filters.complib == 'blosc'
791+
739792
def test_complibs(self):
740793
# GH14478
741794
df = tm.makeDataFrame()

0 commit comments

Comments
 (0)