Skip to content

BUG: to_hdf and HDFStore raise KeyError for DataFrame subclasses #33748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sytham opened this issue Apr 23, 2020 · 4 comments · Fixed by #38262
Closed

BUG: to_hdf and HDFStore raise KeyError for DataFrame subclasses #33748

sytham opened this issue Apr 23, 2020 · 4 comments · Fixed by #38262
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Milestone

Comments

@sytham
Copy link

sytham commented Apr 23, 2020

  • [x ] I have checked that this issue has not already been reported.

  • [x ] I have confirmed this bug exists on the latest version of pandas.


Code Sample, a copy-pastable example

import pandas as pd
class SubDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return SubDataFrame

# fails with KeyError
sdf = SubDataFrame({'a':[1,2], 'b':[3,4]})
sdf.to_hdf('test.h5', 'test')

with pd.HDFStore('test.h5') as store:
    store.put('test', sdf)

Problem description

to_hdf() and HDFStore.put() fail for DataFrame subclasses.

This happens because in pandas/io/pytables.py in _create_storer line 1578 (or thereabouts), the _TYPE_MAP is accessed by type(), whereas upon _create_storer entry, the check is done using isinstance().

Expected Output

The check upon entry of _create_storer should at least be consistent with the way _TYPE_MAP is accessed. So if the choice is not to support writing DataFrame subclasses to HDF, instead of a KeyError, a TypeError("value must be None, Series, or DataFrame") should be raised.

But ideally, storing subclasses of DataFrame to HDF should be supported.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.111-1.el7.centos.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200325
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : 0.4.0
scipy : 1.2.1
sqlalchemy : 1.3.11
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.46.0

@sytham sytham added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 23, 2020
@jreback
Copy link
Contributor

jreback commented Apr 24, 2020

this is possible but would need a community PR
to support

@jreback jreback added Enhancement IO HDF5 read_hdf, HDFStore and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 24, 2020
@jreback jreback added this to the Contributions Welcome milestone Apr 24, 2020
@sytham
Copy link
Author

sytham commented Apr 24, 2020

Well, I'd be happy to change line 1528 of _create_storer from
if value is not None and not isinstance(value, (Series, DataFrame)):
to
if value is not None and type(value) not in (Series, DataFrame):
to correctly catch that subclasses of Series or DataFrame are currently not supported (this is why I initially labeled it a bug -- so maybe this ticket could be split in two then, one for this correction, one for implementing the actual support?).

@yangyxt
Copy link

yangyxt commented Aug 1, 2020

I have run into a similar error by just appending pandas chunk dfs to an HDF5 file.
I have confirmed that the first chunk dtypes and the following chunk dtypes are exactly the same for all the columns. BTW, I have 130 columns and all of them are converted to object dtype before writing to hdf5 file.
Still, I got this error message:

`Traceback (most recent call last):
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3881, in create_axes
b, b_items = by_items.pop(items)
KeyError: ('1000g2015aug_all', 'AAChange.refGene', 'Alt', 'Alt.1', 'CADD_phred', 'CADD_raw', 'CADD_raw_rankscore', 'CLNALLELEID', 'CLNDISDB', 'CLNDN', 'CLNREVSTAT', 'CLNSIG', 'Chr', 'DANN_rankscore', 'DANN_score', 'Eigen-PC-raw', 'Eigen-r

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/paedwy/disk1/yangyxt/ngs_scripts/Reformatting_hg19_multianno_txt.py", line 243, in reformat
if len(new_chunk) > 0: new_chunk.to_hdf(tmp_hdf5_path, key='table', format='table', append=True)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2530, in to_hdf
pytables.to_hdf(path_or_buf, key, self, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 278, in to_hdf
f(store)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 269, in
f = lambda store: store.append(key, value, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1059, in append
self._write_to_group(key, value, append=append, dropna=dropna, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 1525, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4615, in write
return super().write(obj=obj, data_columns=data_columns, **kwargs)
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 4194, in write
axes=axes, obj=obj, validate=append, min_itemsize=min_itemsize, **kwargs
File "/home/yangyxt/anaconda3/lib/python3.6/site-packages/pandas/io/pytables.py", line 3888, in create_axes
items=(",".join(pprint_thing(item) for item in items))
ValueError: cannot match existing table structure for [1000g2015aug_all,AAChange.refGene,Alt,Alt.1,CADD_phred,CADD_raw,CADD_raw_rankscore,CLNALLELEID,CLNDISDB,CLNDN,CLNREVSTAT,CLNSIG,Chr,DANN_rankscore,DANN_score,Eigen-PC-raw,Eigen-raw,`

I didn't find any solutions in the issues. This issue post is the most relevant one. Pls help debug it. Thanks!

@sytham
Copy link
Author

sytham commented Aug 2, 2020

That doesn't look like it's the same issue. This thread is specifically about storing subclasses of DataFrame to HDF.

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Dec 4, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants