BUG: to_hdf and HDFStore raise KeyError for DataFrame subclasses #33748

sytham · 2020-04-23T15:30:32Z

[x ] I have checked that this issue has not already been reported.
[x ] I have confirmed this bug exists on the latest version of pandas.

Code Sample, a copy-pastable example

import pandas as pd
class SubDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return SubDataFrame

# fails with KeyError
sdf = SubDataFrame({'a':[1,2], 'b':[3,4]})
sdf.to_hdf('test.h5', 'test')

with pd.HDFStore('test.h5') as store:
    store.put('test', sdf)

Problem description

to_hdf() and HDFStore.put() fail for DataFrame subclasses.

This happens because in pandas/io/pytables.py in _create_storer line 1578 (or thereabouts), the _TYPE_MAP is accessed by type(), whereas upon _create_storer entry, the check is done using isinstance().

Expected Output

The check upon entry of _create_storer should at least be consistent with the way _TYPE_MAP is accessed. So if the choice is not to support writing DataFrame subclasses to HDF, instead of a KeyError, a TypeError("value must be None, Series, or DataFrame") should be raised.

But ideally, storing subclasses of DataFrame to HDF should be supported.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.111-1.el7.centos.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200325
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : 0.4.0
scipy : 1.2.1
sqlalchemy : 1.3.11
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.46.0

The text was updated successfully, but these errors were encountered:

jreback · 2020-04-24T03:08:44Z

this is possible but would need a community PR
to support

sytham · 2020-04-24T06:29:58Z

Well, I'd be happy to change line 1528 of _create_storer from
if value is not None and not isinstance(value, (Series, DataFrame)):
to
if value is not None and type(value) not in (Series, DataFrame):
to correctly catch that subclasses of Series or DataFrame are currently not supported (this is why I initially labeled it a bug -- so maybe this ticket could be split in two then, one for this correction, one for implementing the actual support?).

yangyxt · 2020-08-01T16:23:35Z

I have run into a similar error by just appending pandas chunk dfs to an HDF5 file.
I have confirmed that the first chunk dtypes and the following chunk dtypes are exactly the same for all the columns. BTW, I have 130 columns and all of them are converted to object dtype before writing to hdf5 file.
Still, I got this error message:

`Traceback (most recent call last):
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3881, in create_axes
b, b_items = by_items.pop(items)
KeyError: ('1000g2015aug_all', 'AAChange.refGene', 'Alt', 'Alt.1', 'CADD_phred', 'CADD_raw', 'CADD_raw_rankscore', 'CLNALLELEID', 'CLNDISDB', 'CLNDN', 'CLNREVSTAT', 'CLNSIG', 'Chr', 'DANN_rankscore', 'DANN_score', 'Eigen-PC-raw', 'Eigen-r

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/paedwy/disk1/yangyxt/ngs_scripts/Reformatting_hg19_multianno_txt.py", line 243, in reformat
if len(new_chunk) > 0: new_chunk.to_hdf(tmp_hdf5_path, key='table', format='table', append=True)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2530, in to_hdf
pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 278, in to_hdf
f(store)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 269, in
f = lambda store: store.append(key, value, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1059, in append
self._write_to_group(key, value, append=append, dropna=dropna, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1525, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4615, in write
return super().write(obj=obj, data_columns=data_columns, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4194, in write
axes=axes, obj=obj, validate=append, min_itemsize=min_itemsize, **kwargs
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3888, in create_axes
items=(",".join(pprint_thing(item) for item in items))
ValueError: cannot match existing table structure for [1000g2015aug_all,AAChange.refGene,Alt,Alt.1,CADD_phred,CADD_raw,CADD_raw_rankscore,CLNALLELEID,CLNDISDB,CLNDN,CLNREVSTAT,CLNSIG,Chr,DANN_rankscore,DANN_score,Eigen-PC-raw,Eigen-raw,`

I didn't find any solutions in the issues. This issue post is the most relevant one. Pls help debug it. Thanks!

sytham · 2020-08-02T13:03:15Z

That doesn't look like it's the same issue. This thread is specifically about storing subclasses of DataFrame to HDF.

sytham added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 23, 2020

jreback added Enhancement IO HDF5 read_hdf, HDFStore and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 24, 2020

jreback added this to the Contributions Welcome milestone Apr 24, 2020

ivanovmg mentioned this issue Dec 3, 2020

BUG: to_hdf and HDFStore for subclasses #38262

Merged

5 tasks

jreback modified the milestones: Contributions Welcome, 1.2 Dec 4, 2020

jreback modified the milestones: Contributions Welcome, 1.3 Dec 19, 2020

jreback closed this as completed in #38262 Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_hdf and HDFStore raise KeyError for DataFrame subclasses #33748

BUG: to_hdf and HDFStore raise KeyError for DataFrame subclasses #33748

sytham commented Apr 23, 2020

INSTALLED VERSIONS

jreback commented Apr 24, 2020

sytham commented Apr 24, 2020

yangyxt commented Aug 1, 2020

sytham commented Aug 2, 2020 •

edited

Loading

BUG: to_hdf and HDFStore raise KeyError for DataFrame subclasses #33748

BUG: to_hdf and HDFStore raise KeyError for DataFrame subclasses #33748

Comments

sytham commented Apr 23, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 24, 2020

sytham commented Apr 24, 2020

yangyxt commented Aug 1, 2020

sytham commented Aug 2, 2020 • edited Loading

Output of `pd.show_versions()`

sytham commented Aug 2, 2020 •

edited

Loading