HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

roberthdevries · 2020-03-02T23:26:12Z

First try to get pandas style tables, if there are none, return the list of non-pandas (hdf5 native) tables in the file (if any)

closes HDF5: empty groups and keys #29916
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

st-bender · 2020-03-04T09:23:13Z

pandas/io/pytables.py

@@ -590,16 +590,35 @@ def __enter__(self):
    def __exit__(self, exc_type, exc_value, traceback):
        self.close()

-    def keys(self) -> List[str]:
+    def keys(self, kind: Optional[str] = "pandas") -> List[str]:


Hi,
Thanks very much for putting it together!
Why not make "tables" the default and worry about the performance later? (Some "tables"/"table" mixup in the docs btw.)
dask for example would have to use the keyword to work properly, which would make it incompatible with
any pandas < 1.0.x that doesn't have this keyword.
Given that, I am not so sure anymore to solve that here and in this way, it is an API change that will most certainly break "userspace", but I'd leave that to the maintainers.

Making "tables" the default is probably breaking the API, but I can do some experiments by changing it and running the unit tests and see what breaks 😄

So that did not work.
Another suggestion that was previously mentioned is to fall back to the "tables" option if the "pandas" option returns no results. This assumes that we either use an HDF5 file that was produced by pandas and only has pandas style tables, or that we read in an HDF5 file that was produced by another application and has HDF5 style tables.
I think that in this way we stay backwards compatible and we do not have the performance issue with the pandas style tables.
I have tried this and it does not break existing unit tests in the pytables test set.

Hi,
Again, thanks for taking it on and coming up with a solution.
Looks good to me, so we'll see.
Chhers.

I've just come across this as it's stopping me from working with HDF5 files not produced by pandas.

My issue - how will this cope with files that were produced in some upstream system conforming to the HDF5 specification, then data added to it with pandas (so you have a mix)

Surely the real fix is for pandas (or maybe more specifically HDFStore) to conform to the HDF5 specification (so files work with h5py)

pandas has a specific format of meta-data on top of HDF5. This is not likely to change.

st-bender · 2020-03-04T09:24:34Z

pandas/tests/io/pytables/test_store.py

+                assert len(store.keys(kind="tables")) == 3
+                expected = {"/group/table1", "/group/table2", "/group/table3"}
+                assert set(store.keys(kind="tables")) == expected
+                assert set(store) == set()


Shouldn't that be

assert set(store.keys()) == set()

?

This form is equivalent, as the store implements the __iter__ method which just returns an iterator with keys()
I will change it to be explicit.

Thanks for the explanation, that makes sense indeed.

jreback · 2020-03-04T14:21:28Z

pandas/io/pytables.py

@@ -593,13 +593,20 @@ def __exit__(self, exc_type, exc_value, traceback):
    def keys(self) -> List[str]:
        """
        Return a list of keys corresponding to objects stored in HDFStore.
+        If the store contains pandas native tables, it will return their names.


this is not a good api.

either:
a) always return all tables (api breaking), or
b) add a keyword include='pandas'|'native'|'all', default would be 'pandas'

Note I am not the PR author but reported the issue.

It really depends on what you want to achieve, IMHO better compatibility would be achieved with option (a) which basically reverts #21543 but for keys() and leaving groups() untouched. Don't know if that breaks anything else, but it might hurt performance in some cases although the original bug #21372 complained about the performance of groups() (it links to #17593 about slow keys()). There is also a walk() method which is very similar to groups().

(b) would break backwards compatibility of all (future) projects that use that keyword wrt pandas < 1.0.x.

The latest version is trying to keep some backwards compatibility with pandas 0.23 (plus or minus a minor revision). Apparently there are some tools that depend on the original behavior (hence the issue which this PR is trying to fix). So either we keep breaking backward compatibility or we add a keyword.
I think this is a matter of taste/policy.

@jreback I think this is a trade-off between having backwards compatible behavior with v0.23.x vs. a clean API.
I see one vote from @st-bender for being backwards compatible, and I see one vote from @jreback for the clean API
Note sure whose vote has more weight (:smile: but I can make a gues), but I have no strong feelings either way.

Thanks @roberthdevries for pushing this forward and @jreback thanks for the feedback.
As I see it there is no fully compatible way to change that behaviour back to <=0.23.x, and apparently I am the only one that noticed that when dask.dataframe.read_hdf() suddenly failed to read some files which worked before (dask/dask#5934).

It seems that dask.dataframe is currently the only serious "user" of that interface, and I filed dask/dask#5934 to decide whether it is better handled within dask instead of pandas. The change there would be similarly simple and would probably be more backwards compatible with older pandas versions.

jreback · 2020-03-14T16:01:03Z

is this better handled upstream in dask? I am ok with a keyword to .keys but cannot change the behavior like this.

An additional kind parameter has been added that defaults to pandas original behavior, but with 'tables' value gives you the list of non-pandas tables in the file

- improve type annotation of the HDFStore.keys method - minor improvement in ValueError string - minor improvement in doc-string

- first try new behavior - if no result return native HDF5 Tables

roberthdevries · 2020-03-15T10:55:34Z

I have created a new PR #32723 which has the extra 'include' keyword.
I haven't added the 'all' value as mixing pandas tables and native tables in one file is an anti-pattern (@jreback 's words), which should not be encouraged.

st-bender · 2020-03-17T09:19:21Z

is this better handled upstream in dask?

I'll try to find out, there was not a lot of response so far, so I'll ping and ask.

I am ok with a keyword to .keys but cannot change the behavior like this.

I understand and I am with you wrt not changing the API behavoiur.

jreback · 2020-06-09T22:48:45Z

I am not sure it is worthile to change anything here. The fact that HDFStore doesn't recognize non-pandas tables is a non-goal here.

jamesbooker · 2020-06-10T06:42:08Z

I am not sure it is worthile to change anything here. The fact that HDFStore doesn't recognize non-pandas tables is a non-goal here.

This is a disappointing attitude, because this used to work. It's not an enhancement, but a regression that was broken by an 'optimisation'

Groups and keys aren't a feature of pandas, but a feature of the HDF5 standard. If pandas is going to claim support for the HDF5 file format, then it should support the whole format. This isn't even restricted to the dataframe/tables level. You're basically stating the equivalent of 'supporting files stored in a directory not created by pandas is a non-goal'.

At the dataframe level, I don't understand why you'd not want the ability to load data from a non-pandas table and create a dataframe from it - pandas isn't the only library out there for working with HDF5, and it's safe to assume not all users will be working with data that came out of pandas in the first place (got my hand up here.)

I do, however understand why you'd not want to be converting back from a dataframe to a 'non-pandas table' - being able to read data from a non-pandas HDF5 but only write pandas tables makes some kind of sense.

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Just one person's opinion.

EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

st-bender · 2020-06-10T08:10:28Z

I am not sure it is worthile to change anything here. The fact that HDFStore doesn't recognize non-pandas tables is a non-goal here.

This is a disappointing attitude, because this used to work. It's not an enhancement, but a regression that was broken by an 'optimisation'

Groups and keys aren't a feature of pandas, but a feature of the HDF5 standard. If pandas is going to claim support for the HDF5 file format, then it should support the whole format. This isn't even restricted to the dataframe/tables level. You're basically stating the equivalent of 'supporting files stored in a directory not created by pandas is a non-goal'.

At the dataframe level, I don't understand why you'd not want the ability to load data from a non-pandas table and create a dataframe from it - pandas isn't the only library out there for working with HDF5, and it's safe to assume not all users will be working with data that came out of pandas in the first place (got my hand up here.)

Sorry to jump in here, but importing non-pandas files should work just fine and was not the issue here. It is just not possible to list all the keys/groups with standard pandas anymore which lead to downstream problems, most notable with dask. You can have a look at the PR here or at the linked dask issue on how to list all of them with pytables.

If you have trouble importing hdf5 files in general, then that may be a separate issue.

[...]

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Yeah I cannot understand that either, but it is the developers' decision to cripple hdf5 support to pandas specific files.

Just one person's opinion.

EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

That would be a really bad decision indeed.

jamesbooker · 2020-06-10T09:02:32Z

Sorry to jump in here,

No need to apologise. It is a discussion, after all

but importing non-pandas files should work just fine and was not the issue here. It is just not possible to list all the keys/groups with standard pandas anymore which lead to downstream problems, most notable with dask. You can have a look at the PR here or at the linked dask issue on how to list all of them with pytables.

The problem isn't just with dask (I'm not using dask). I am trying to import data from non-pandas HDF5 files, to be put into a dataframe, which would then be exported to pandas HDF5 files. My issue was with not being able to list the keys/groups - I have no control of the format of the data coming in, and I need to 'walk' the groups to figure out what's in the file (the file format has a versioning system in it, so I need to walk the groups to figure out how many there are, and where to pull my specific dataset from) - so I guess you could say I have the same issue as dask. They've implemented a workaround which doesn't use the public API of pandas (by drilling down directly into pytables)

If you have trouble importing hdf5 files in general, then that may be a separate issue.

No, importing the data from a pandas file works fine, my issue is with groups/keys also.

[...]

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Yeah I cannot understand that either, but it is the developers' decision to cripple hdf5 support to pandas specific files.

The biggest issue I have is the 'we are not adding new methods' statement. I understand pandas wants a clean API, and certainly doesn't want to roll back a change made so long ago. What I can't fathom is why pandas couldn't have a seperate raw_keys() and raw_groups() method in addition to the keys() and groups() methods. This wouldn't break the existing API, and would allow us to still walk these groups while still keeping in place this 'optimisation'

Just one person's opinion.
EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

That would be a really bad decision indeed.

What would be a bad decision? I don't understand this last statement.

Thanks for your insight

st-bender · 2020-06-10T20:20:22Z

Sorry to jump in here,

No need to apologise. It is a discussion, after all

I believe that the discussion should take place at the issue, not here.

but importing non-pandas files should work just fine and was not the issue here. It is just not possible to list all the keys/groups with standard pandas anymore which lead to downstream problems, most notable with dask. You can have a look at the PR here or at the linked dask issue on how to list all of them with pytables.

The problem isn't just with dask (I'm not using dask). I am trying to import data from non-pandas HDF5 files, to be put into a dataframe, which would then be exported to pandas HDF5 files. My issue was with not being able to list the keys/groups - I have no control of the format of the data coming in, and I need to 'walk' the groups to figure out what's in the file (the file format has a versioning system in it, so I need to walk the groups to figure out how many there are, and where to pull my specific dataset from) - so I guess you could say I have the same issue as dask. They've implemented a workaround which doesn't use the public API of pandas (by drilling down directly into pytables)

Probably then dask may be actually the right tool, it allows wildcards in the key argument. Otherwise you could probably use plain pytables in the first round to get the keys, and then pandas to read them. I don't think that the issue will be fixed in pandas.

[...]

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Yeah I cannot understand that either, but it is the developers' decision to cripple hdf5 support to pandas specific files.

The biggest issue I have is the 'we are not adding new methods' statement. I understand pandas wants a clean API, and certainly doesn't want to roll back a change made so long ago. What I can't fathom is why pandas couldn't have a seperate raw_keys() and raw_groups() method in addition to the keys() and groups() methods. This wouldn't break the existing API, and would allow us to still walk these groups while still keeping in place this 'optimisation'

I understand the intention to not change the API. Hdf5 is apparently a very minor part such that nobody really bothers. The best would have been to revert the change and admit that it was screwed up. Would have saved a lot of developer resources.

Just one person's opinion.
EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

That would be a really bad decision indeed.

What would be a bad decision? I don't understand this last statement.

Sorry, my comment was about read_hdf only supporting pandas objects, which would be a really bad choice imo.

I think it is best to continue at the issue, and not fill this pull request further.

jreback · 2020-06-14T15:13:03Z

ok I think I commented else where that If you want to support .keys(nodes='pandas|all') I think would be ok (and if .groups needs this keyword then ok too).

jreback · 2020-06-14T15:32:10Z

prefer soln in #32723

roberthdevries force-pushed the fix-empty-hdfstore-keys branch from 3c7feb3 to 22f8d9e Compare March 3, 2020 21:03

st-bender reviewed Mar 4, 2020

View reviewed changes

roberthdevries force-pushed the fix-empty-hdfstore-keys branch from 839db77 to 7adcafb Compare March 4, 2020 13:17

jreback requested changes Mar 4, 2020

View reviewed changes

jreback added the IO HDF5 read_hdf, HDFStore label Mar 4, 2020

roberthdevries force-pushed the fix-empty-hdfstore-keys branch 2 times, most recently from 91eb111 to 9fd51cb Compare March 14, 2020 13:23

roberthdevries added 12 commits March 15, 2020 08:24

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file

b701c5b

An additional kind parameter has been added that defaults to pandas original behavior, but with 'tables' value gives you the list of non-pandas tables in the file

Flake8 fixes

bde4c96

black reformatter

e1ce3d5

Add whatsnew entry

d3f4f04

Correct the order of the keywords in the docstring

ce4dba1

Remove unused variables in the test code

27aac41

Minor cleanups

25fa1e4

- improve type annotation of the HDFStore.keys method - minor improvement in ValueError string - minor improvement in doc-string

Make test code for the non-pandas case a bit more explicit

eabed52

Simplify interface by using a fallback scenario

76bd80b

- first try new behavior - if no result return native HDF5 Tables

Update the whatsnew entry to reflect the changed behavior

54d43b1

Forgot to remove commented out line of code

9e44d2d

Remove redundant check and make mypy happy

a903c6e

roberthdevries force-pushed the fix-empty-hdfstore-keys branch from 9fd51cb to a903c6e Compare March 15, 2020 07:25

st-bender mentioned this pull request Mar 17, 2020

dd.read_hdf fails with pandas >= 0.24.0 dask/dask#5934

Closed

jreback added the API Design label Jun 14, 2020

jreback closed this Jun 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

roberthdevries commented Mar 2, 2020 •

edited

Loading

st-bender Mar 4, 2020

roberthdevries Mar 4, 2020

roberthdevries Mar 4, 2020 •

edited

Loading

st-bender Mar 4, 2020

jamesbooker Jun 5, 2020 •

edited

Loading

jreback Jun 9, 2020

st-bender Mar 4, 2020

roberthdevries Mar 4, 2020

st-bender Mar 4, 2020

jreback Mar 4, 2020

st-bender Mar 4, 2020

roberthdevries Mar 4, 2020

roberthdevries Mar 8, 2020

st-bender Mar 9, 2020

jreback commented Mar 14, 2020

roberthdevries commented Mar 15, 2020

st-bender commented Mar 17, 2020

jreback commented Jun 9, 2020

jamesbooker commented Jun 10, 2020 •

edited

Loading

st-bender commented Jun 10, 2020

jamesbooker commented Jun 10, 2020

st-bender commented Jun 10, 2020

jreback commented Jun 14, 2020

jreback commented Jun 14, 2020

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

Conversation

roberthdevries commented Mar 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roberthdevries Mar 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesbooker Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 14, 2020

roberthdevries commented Mar 15, 2020

st-bender commented Mar 17, 2020

jreback commented Jun 9, 2020

jamesbooker commented Jun 10, 2020 • edited Loading

st-bender commented Jun 10, 2020

jamesbooker commented Jun 10, 2020

st-bender commented Jun 10, 2020

jreback commented Jun 14, 2020

jreback commented Jun 14, 2020

roberthdevries commented Mar 2, 2020 •

edited

Loading

roberthdevries Mar 4, 2020 •

edited

Loading

jamesbooker Jun 5, 2020 •

edited

Loading

jamesbooker commented Jun 10, 2020 •

edited

Loading