Skip to content

df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dragoljub opened this issue Oct 23, 2016 · 5 comments · Fixed by #16196
Labels
Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Milestone

Comments

@dragoljub
Copy link

df.to_hdf() blocks access to the following compressors offered in pytables 3.3.0: ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’.

I would like to try blosc:lz4 compression for some of the bigger data I have to compare size and speed to LZO.

df.to_hdf(path, 'df', complib='blosc:lz4')
D:\Python27\lib\site-packages\pandas\io\pytables.pyc in __init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
    434 
    435         if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib'):
--> 436             raise ValueError("complib only supports 'blosc', 'bzip2', lzo' "
    437                              "or 'zlib' compression.")
    438 

ValueError: complib only supports 'blosc', 'bzip2', lzo' or 'zlib' compression.
@jreback
Copy link
Contributor

jreback commented Oct 23, 2016

done originally here: https://github.com/pandas-dev/pandas/pull/10341/files

should be easy enough to expand this list if you would like to do a PR

the check should directly introspect pytables for this validation I think

@jreback jreback added IO HDF5 read_hdf, HDFStore Error Reporting Incorrect or improved errors from pandas Difficulty Novice labels Oct 23, 2016
@jreback jreback added this to the Next Major Release milestone Oct 23, 2016
@dragoljub
Copy link
Author

@jreback thanks for the link with details. I'll take look at some local testing and let you know if the blosc compressors work.

@bashtage
Copy link
Contributor

@dragoljub When this patch was submitted pandas did not work with the multi-compression filters. Things might have changed.

@bashtage
Copy link
Contributor

There is a verbal description of a case the produced incorrect results here:

#8874

@dragoljub
Copy link
Author

I made this simple change in \pandas\io\pytables.py line 435 and now the different compression libraries seem to work with PyTables 3.3.0 and Pandas 0.19.0. I'll need some time to get a PR prepared with tests etc.

if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'):

Blosc:LZ4 reads my data about 30% faster than LZO with about the same compression ratio. I'll have to play more with different combination of strings and floats but so far it seems to be a nice option to have.

Unfortunately nothing comes close to the compression ratio I get with gzipped pickle files. I bet HDF5 CArray chunkshape being row-major in float32 blocks removes some of the benefits we may see with pure columnar chunked compression for columns with repeated values.

Some Benchmarks:

In [64]: %time df.to_hdf(r'df_none.h5', 'df', mode='w')
Wall time: 1.12 s

In [67]: %time df.to_hdf(r'df_lzo.h5', 'df', mode='w', complib='lzo', complevel=9)
Wall time: 378 ms

In [68]: %time df.to_hdf(r'df_lz4.h5', 'df', mode='w', complib='blosc:lz4', complevel=9)
Wall time: 357 ms

In [69]: %time df.to_hdf(r'df_zstd.h5', 'df', mode='w', complib='blosc:zstd', complevel=9)
Wall time: 28.4 s

In [70]: %time df.to_hdf(r'df_lz4hc.h5', 'df', mode='w', complib='blosc:lz4hc', complevel=9)
Wall time: 33.2 s

In [71]: %timeit  pd.read_hdf(r'df_none.h5', mode='r')
10 loops, best of 3: 134 ms per loop

In [72]: %timeit  pd.read_hdf(r'df_lzo.h5', mode='r')
1 loop, best of 3: 389 ms per loop

In [73]: %timeit  pd.read_hdf(r'df_lz4.h5', mode='r')
1 loop, best of 3: 277 ms per loop

In [74]: %timeit  pd.read_hdf(r'df_zstd.h5', mode='r')
1 loop, best of 3: 471 ms per loop

In [75]: %timeit  pd.read_hdf(r'df_lz4hc.h5', mode='r')
1 loop, best of 3: 260 ms per loop

In [76]: %time df.to_pickle(r'df.pkl')
Wall time: 1.25 s

In [77]: %timeit pd.read_pickle(r'df.pkl')
1 loop, best of 3: 228 ms per loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants