Skip to content

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,7 @@ I/O
- Bug in :class:`HDFStore` that caused it to set to ``int64`` the dtype of a ``datetime64`` column when reading a DataFrame in Python 3 from fixed format written in Python 2 (:issue:`31750`)
- Bug in :meth:`read_excel` where a UTF-8 string with a high surrogate would cause a segmentation violation (:issue:`23809`)
- Bug in :meth:`read_csv` was causing a file descriptor leak on an empty file (:issue:`31488`)
- :meth:`HDFStore.keys` now tries to get the list of native pandas tables first, and if there are none, it gets the native HDF5 table names (:issue:`29916`)


Plotting
Expand Down
9 changes: 8 additions & 1 deletion pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -583,13 +583,20 @@ def __exit__(self, exc_type, exc_value, traceback):
def keys(self) -> List[str]:
"""
Return a list of keys corresponding to objects stored in HDFStore.
If the store contains pandas native tables, it will return their names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good api.

either:
a) always return all tables (api breaking), or
b) add a keyword include='pandas'|'native'|'all', default would be 'pandas'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I am not the PR author but reported the issue.

It really depends on what you want to achieve, IMHO better compatibility would be achieved with option (a) which basically reverts #21543 but for keys() and leaving groups() untouched. Don't know if that breaks anything else, but it might hurt performance in some cases although the original bug #21372 complained about the performance of groups() (it links to #17593 about slow keys()). There is also a walk() method which is very similar to groups().

(b) would break backwards compatibility of all (future) projects that use that keyword wrt pandas < 1.0.x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version is trying to keep some backwards compatibility with pandas 0.23 (plus or minus a minor revision). Apparently there are some tools that depend on the original behavior (hence the issue which this PR is trying to fix). So either we keep breaking backward compatibility or we add a keyword.
I think this is a matter of taste/policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think this is a trade-off between having backwards compatible behavior with v0.23.x vs. a clean API.
I see one vote from @st-bender for being backwards compatible, and I see one vote from @jreback for the clean API
Note sure whose vote has more weight (:smile: but I can make a gues), but I have no strong feelings either way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roberthdevries for pushing this forward and @jreback thanks for the feedback.
As I see it there is no fully compatible way to change that behaviour back to <=0.23.x, and apparently I am the only one that noticed that when dask.dataframe.read_hdf() suddenly failed to read some files which worked before (dask/dask#5934).

It seems that dask.dataframe is currently the only serious "user" of that interface, and I filed dask/dask#5934 to decide whether it is better handled within dask instead of pandas. The change there would be similarly simple and would probably be more backwards compatible with older pandas versions.

Otherwise the list of names of HDF5 Table objects will be returned.

Returns
-------
list
List of ABSOLUTE path-names (e.g. have the leading '/').
"""
return [n._v_pathname for n in self.groups()]
objects = [n._v_pathname for n in self.groups()]
if objects:
return objects

assert self._handle is not None # mypy
return [n._v_pathname for n in self._handle.walk_nodes("/", classname="Table")]

def __iter__(self):
return iter(self.keys())
Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/io/pytables/test_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,29 @@ def test_keys(self, setup_path):
assert set(store.keys()) == expected
assert set(store) == expected

def test_non_pandas_keys(self, setup_path):
# GH 29916
class Table1(tables.IsDescription):
value1 = tables.Float32Col()

class Table2(tables.IsDescription):
value2 = tables.Float32Col()

class Table3(tables.IsDescription):
value3 = tables.Float32Col()

with ensure_clean_path(setup_path) as path:
with tables.open_file(path, mode="w") as h5file:
group = h5file.create_group("/", "group")
h5file.create_table(group, "table1", Table1, "Table 1")
h5file.create_table(group, "table2", Table2, "Table 2")
h5file.create_table(group, "table3", Table3, "Table 3")
with HDFStore(path) as store:
assert len(store.keys()) == 3
expected = {"/group/table1", "/group/table2", "/group/table3"}
assert set(store.keys()) == expected
assert set(store) == expected

def test_keys_ignore_hdf_softlink(self, setup_path):

# GH 20523
Expand Down