Skip to content

HDF corrupts data when using complib='blosc:zlib' #8874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bashtage opened this issue Nov 21, 2014 · 8 comments
Closed

HDF corrupts data when using complib='blosc:zlib' #8874

bashtage opened this issue Nov 21, 2014 · 8 comments
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Milestone

Comments

@bashtage
Copy link
Contributor

I'm not sure if this is supported or not -- it isn't in the doc string for HDFStore, but it seems to be allowed by the HDFStore (nothing is raised).

Unfortunately so far I can only get it to show the bad behavior on a proprietary dataset, which is storing a pd.Panel which contains items of of mixed types.

Some of the float values are be changed from small (|x|<1.0) to very large (3.xe+308).

Is the compressor just passed through to pytables? If so, this might be a pytables issue.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.23.3.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1-24-g54e237b
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.7.0.dev-c8e980d
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

hmm, I don't think you can pass in a compressor like that, its either 'blosc' OR 'zlib'

@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

related to this issue: #4582

@jreback jreback added Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore labels Nov 21, 2014
@bashtage
Copy link
Contributor Author

You can (sometime) in pytables - doesn't mean it is OK for pandas though.

One example is here.

https://pytables.github.io/usersguide/utilities.html

@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

interesting didn't know that
can u show a small smample of the frame and dtypes? and code u r saving with? (e.g. is it fixed or table that you are writing)

sorry you said Panel ! in any event if you can post an example would be good.

But this is pretty much passed straight thru to pytables.

@bashtage
Copy link
Contributor Author

Desription:

final
Out[46]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 7 (items) x 254 (major_axis) x 7592 (minor_axis)
Items axis: EXCHCD to VOL
Major_axis axis: 19960102 to 19961231
Minor_axis axis: 10001 to 93316

final.dtypes
Out[47]: 
EXCHCD      float64
PRC         float64
PRIMEXCH     object
RET         float64
SHRCLS       object
SHROUT      float64
VOL         float64
dtype: object

Code:

store = pd.HDFStore('clean_' + str(year) + '.h5', complib='blosc:zlib', complevel=7)
store['data'] = final
store.close()

@jreback
Copy link
Contributor

jreback commented Nov 21, 2014

In [1]: p = tm.makePanel()

In [3]: store = pd.HDFStore('test.h5',mode='w')

In [4]: store.put('p_no_comp',p)

In [5]: store.put('p',p,complib='blosc')
ValueError: Compression not supported on Fixed format stores

In [6]: store.put('p',p,complib='blosc:zlib')
In [7]: store.put('p',p,complib='blosc:zlib',format='table')

In [8]: store
Out[8]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/p                    wide_table   (typ->appendable,nrows->120,ncols->3,indexers->[major_axis,minor_axis])
/p_no_comp            wide         (shape->[3,30,4])                                                      

In [10]: store.select('p')
Out[10]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 30 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-02-11 00:00:00
Minor_axis axis: A to D

So this is something I don't really understand about HDF5.

You can open the file with compression and/or compress an individual node by using a compressed storage format (e.g. a CArray).

but this is not allowed for certain types (e.g. Panel) because of how they are stored. So you can only store via 'table' format, but not fixed (I don't remember specifically why).

So you can use what I put above to actually store it (via table format).

Separately this is not reporting the errors correctly It think when you try to compress the entire store.
So I think the error reporting on a node is ok (and that's what #4582) is about. I just don't really understand what compressing an entire store is supposed to actually do (if anything, maybe its not actually supported).

@jreback jreback added Bug Compat pandas objects compatability with Numpy or Python functions labels Nov 21, 2014
@jreback jreback added this to the 0.16.0 milestone Nov 21, 2014
@bashtage
Copy link
Contributor Author

Thanks - hard to change old habits, but will use table in the future.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
bashtage pushed a commit to bashtage/pandas that referenced this issue Jun 12, 2015
Add check for complib when opening a HDFStore

closes pandas-dev#4582
closes pandas-dev#8874
@jreback jreback modified the milestones: 0.16.2, Next Major Release Jun 12, 2015
jreback pushed a commit that referenced this issue Jun 12, 2015
@jreback
Copy link
Contributor

jreback commented Jun 12, 2015

closed by #10341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants