df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478

dragoljub · 2016-10-23T18:10:06Z

df.to_hdf() blocks access to the following compressors offered in pytables 3.3.0: ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’.

I would like to try blosc:lz4 compression for some of the bigger data I have to compare size and speed to LZO.

df.to_hdf(path, 'df', complib='blosc:lz4')
D:\Python27\lib\site-packages\pandas\io\pytables.pyc in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    434 
    435         if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib'):
--> 436             raise ValueError("complib only supports 'blosc', 'bzip2', lzo' "
    437                              "or 'zlib' compression.")
    438 

ValueError: complib only supports 'blosc', 'bzip2', lzo' or 'zlib' compression.

The text was updated successfully, but these errors were encountered:

jreback · 2016-10-23T18:17:59Z

done originally here: https://github.com/pandas-dev/pandas/pull/10341/files

should be easy enough to expand this list if you would like to do a PR

the check should directly introspect pytables for this validation I think

dragoljub · 2016-10-23T18:41:14Z

@jreback thanks for the link with details. I'll take look at some local testing and let you know if the blosc compressors work.

bashtage · 2016-10-24T13:27:06Z

@dragoljub When this patch was submitted pandas did not work with the multi-compression filters. Things might have changed.

bashtage · 2016-10-24T13:28:19Z

There is a verbal description of a case the produced incorrect results here:

#8874

dragoljub · 2016-10-24T18:08:31Z

I made this simple change in \pandas\io\pytables.py line 435 and now the different compression libraries seem to work with PyTables 3.3.0 and Pandas 0.19.0. I'll need some time to get a PR prepared with tests etc.

if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'):

Blosc:LZ4 reads my data about 30% faster than LZO with about the same compression ratio. I'll have to play more with different combination of strings and floats but so far it seems to be a nice option to have.

Unfortunately nothing comes close to the compression ratio I get with gzipped pickle files. I bet HDF5 CArray chunkshape being row-major in float32 blocks removes some of the benefits we may see with pure columnar chunked compression for columns with repeated values.

Some Benchmarks:

In [64]: %time df.to_hdf(r'df_none.h5', 'df', mode='w')
Wall time: 1.12 s

In [67]: %time df.to_hdf(r'df_lzo.h5', 'df', mode='w', complib='lzo', complevel=9)
Wall time: 378 ms

In [68]: %time df.to_hdf(r'df_lz4.h5', 'df', mode='w', complib='blosc:lz4', complevel=9)
Wall time: 357 ms

In [69]: %time df.to_hdf(r'df_zstd.h5', 'df', mode='w', complib='blosc:zstd', complevel=9)
Wall time: 28.4 s

In [70]: %time df.to_hdf(r'df_lz4hc.h5', 'df', mode='w', complib='blosc:lz4hc', complevel=9)
Wall time: 33.2 s

In [71]: %timeit  pd.read_hdf(r'df_none.h5', mode='r')
10 loops, best of 3: 134 ms per loop

In [72]: %timeit  pd.read_hdf(r'df_lzo.h5', mode='r')
1 loop, best of 3: 389 ms per loop

In [73]: %timeit  pd.read_hdf(r'df_lz4.h5', mode='r')
1 loop, best of 3: 277 ms per loop

In [74]: %timeit  pd.read_hdf(r'df_zstd.h5', mode='r')
1 loop, best of 3: 471 ms per loop

In [75]: %timeit  pd.read_hdf(r'df_lz4hc.h5', mode='r')
1 loop, best of 3: 260 ms per loop

In [76]: %time df.to_pickle(r'df.pkl')
Wall time: 1.25 s

In [77]: %timeit pd.read_pickle(r'df.pkl')
1 loop, best of 3: 228 ms per loop

jreback added IO HDF5 read_hdf, HDFStore Error Reporting Incorrect or improved errors from pandas Difficulty Novice labels Oct 23, 2016

jreback added this to the Next Major Release milestone Oct 23, 2016

jreback mentioned this issue Apr 7, 2017

HDFStore - raising an exception when complevel > 0 and complib is None #15943

Closed

linebp mentioned this issue May 2, 2017

Unblock supported compression libs in pytables #16196

Merged

4 tasks

jreback modified the milestones: 0.20.1, Next Major Release, 0.20.2 May 2, 2017

jreback closed this as completed in #16196 May 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478

df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478

dragoljub commented Oct 23, 2016

jreback commented Oct 23, 2016 •

edited

Loading

dragoljub commented Oct 23, 2016

bashtage commented Oct 24, 2016

bashtage commented Oct 24, 2016

dragoljub commented Oct 24, 2016

df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478

df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478

Comments

dragoljub commented Oct 23, 2016

jreback commented Oct 23, 2016 • edited Loading

dragoljub commented Oct 23, 2016

bashtage commented Oct 24, 2016

bashtage commented Oct 24, 2016

dragoljub commented Oct 24, 2016

jreback commented Oct 23, 2016 •

edited

Loading