-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
df.to_hdf() blocks some supported pytables compression types ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’ #14478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
done originally here: https://github.com/pandas-dev/pandas/pull/10341/files should be easy enough to expand this list if you would like to do a PR the check should directly introspect pytables for this validation I think |
@jreback thanks for the link with details. I'll take look at some local testing and let you know if the blosc compressors work. |
@dragoljub When this patch was submitted pandas did not work with the multi-compression filters. Things might have changed. |
There is a verbal description of a case the produced incorrect results here: |
I made this simple change in if complib not in (None, 'blosc', 'bzip2', 'lzo', 'zlib', 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd'):
Unfortunately nothing comes close to the compression ratio I get with gzipped pickle files. I bet HDF5 CArray chunkshape being row-major in float32 blocks removes some of the benefits we may see with pure columnar chunked compression for columns with repeated values. Some Benchmarks: In [64]: %time df.to_hdf(r'df_none.h5', 'df', mode='w')
Wall time: 1.12 s
In [67]: %time df.to_hdf(r'df_lzo.h5', 'df', mode='w', complib='lzo', complevel=9)
Wall time: 378 ms
In [68]: %time df.to_hdf(r'df_lz4.h5', 'df', mode='w', complib='blosc:lz4', complevel=9)
Wall time: 357 ms
In [69]: %time df.to_hdf(r'df_zstd.h5', 'df', mode='w', complib='blosc:zstd', complevel=9)
Wall time: 28.4 s
In [70]: %time df.to_hdf(r'df_lz4hc.h5', 'df', mode='w', complib='blosc:lz4hc', complevel=9)
Wall time: 33.2 s
In [71]: %timeit pd.read_hdf(r'df_none.h5', mode='r')
10 loops, best of 3: 134 ms per loop
In [72]: %timeit pd.read_hdf(r'df_lzo.h5', mode='r')
1 loop, best of 3: 389 ms per loop
In [73]: %timeit pd.read_hdf(r'df_lz4.h5', mode='r')
1 loop, best of 3: 277 ms per loop
In [74]: %timeit pd.read_hdf(r'df_zstd.h5', mode='r')
1 loop, best of 3: 471 ms per loop
In [75]: %timeit pd.read_hdf(r'df_lz4hc.h5', mode='r')
1 loop, best of 3: 260 ms per loop
In [76]: %time df.to_pickle(r'df.pkl')
Wall time: 1.25 s
In [77]: %timeit pd.read_pickle(r'df.pkl')
1 loop, best of 3: 228 ms per loop |
df.to_hdf()
blocks access to the following compressors offered in pytables 3.3.0: ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’ and ‘blosc:zstd’.I would like to try blosc:lz4 compression for some of the bigger data I have to compare size and speed to LZO.
The text was updated successfully, but these errors were encountered: