Skip to content

HDFStore: Fix empty result of keys() method on non-pandas hdf5 file #32401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

roberthdevries
Copy link
Contributor

@roberthdevries roberthdevries commented Mar 2, 2020

First try to get pandas style tables, if there are none, return the list of non-pandas (hdf5 native) tables in the file (if any)

@roberthdevries roberthdevries force-pushed the fix-empty-hdfstore-keys branch from 3c7feb3 to 22f8d9e Compare March 3, 2020 21:03
@@ -590,16 +590,35 @@ def __enter__(self):
def __exit__(self, exc_type, exc_value, traceback):
self.close()

def keys(self) -> List[str]:
def keys(self, kind: Optional[str] = "pandas") -> List[str]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Thanks very much for putting it together!
Why not make "tables" the default and worry about the performance later? (Some "tables"/"table" mixup in the docs btw.)
dask for example would have to use the keyword to work properly, which would make it incompatible with
any pandas < 1.0.x that doesn't have this keyword.
Given that, I am not so sure anymore to solve that here and in this way, it is an API change that will most certainly break "userspace", but I'd leave that to the maintainers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making "tables" the default is probably breaking the API, but I can do some experiments by changing it and running the unit tests and see what breaks 😄

Copy link
Contributor Author

@roberthdevries roberthdevries Mar 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that did not work.
Another suggestion that was previously mentioned is to fall back to the "tables" option if the "pandas" option returns no results. This assumes that we either use an HDF5 file that was produced by pandas and only has pandas style tables, or that we read in an HDF5 file that was produced by another application and has HDF5 style tables.
I think that in this way we stay backwards compatible and we do not have the performance issue with the pandas style tables.
I have tried this and it does not break existing unit tests in the pytables test set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Again, thanks for taking it on and coming up with a solution.
Looks good to me, so we'll see.
Chhers.

Copy link

@jamesbooker jamesbooker Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just come across this as it's stopping me from working with HDF5 files not produced by pandas.

My issue - how will this cope with files that were produced in some upstream system conforming to the HDF5 specification, then data added to it with pandas (so you have a mix)

Surely the real fix is for pandas (or maybe more specifically HDFStore) to conform to the HDF5 specification (so files work with h5py)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas has a specific format of meta-data on top of HDF5. This is not likely to change.

assert len(store.keys(kind="tables")) == 3
expected = {"/group/table1", "/group/table2", "/group/table3"}
assert set(store.keys(kind="tables")) == expected
assert set(store) == set()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be

assert set(store.keys()) == set()

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This form is equivalent, as the store implements the __iter__ method which just returns an iterator with keys()
I will change it to be explicit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, that makes sense indeed.

@roberthdevries roberthdevries force-pushed the fix-empty-hdfstore-keys branch from 839db77 to 7adcafb Compare March 4, 2020 13:17
@@ -593,13 +593,20 @@ def __exit__(self, exc_type, exc_value, traceback):
def keys(self) -> List[str]:
"""
Return a list of keys corresponding to objects stored in HDFStore.
If the store contains pandas native tables, it will return their names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a good api.

either:
a) always return all tables (api breaking), or
b) add a keyword include='pandas'|'native'|'all', default would be 'pandas'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I am not the PR author but reported the issue.

It really depends on what you want to achieve, IMHO better compatibility would be achieved with option (a) which basically reverts #21543 but for keys() and leaving groups() untouched. Don't know if that breaks anything else, but it might hurt performance in some cases although the original bug #21372 complained about the performance of groups() (it links to #17593 about slow keys()). There is also a walk() method which is very similar to groups().

(b) would break backwards compatibility of all (future) projects that use that keyword wrt pandas < 1.0.x.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version is trying to keep some backwards compatibility with pandas 0.23 (plus or minus a minor revision). Apparently there are some tools that depend on the original behavior (hence the issue which this PR is trying to fix). So either we keep breaking backward compatibility or we add a keyword.
I think this is a matter of taste/policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think this is a trade-off between having backwards compatible behavior with v0.23.x vs. a clean API.
I see one vote from @st-bender for being backwards compatible, and I see one vote from @jreback for the clean API
Note sure whose vote has more weight (:smile: but I can make a gues), but I have no strong feelings either way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roberthdevries for pushing this forward and @jreback thanks for the feedback.
As I see it there is no fully compatible way to change that behaviour back to <=0.23.x, and apparently I am the only one that noticed that when dask.dataframe.read_hdf() suddenly failed to read some files which worked before (dask/dask#5934).

It seems that dask.dataframe is currently the only serious "user" of that interface, and I filed dask/dask#5934 to decide whether it is better handled within dask instead of pandas. The change there would be similarly simple and would probably be more backwards compatible with older pandas versions.

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Mar 4, 2020
@roberthdevries roberthdevries force-pushed the fix-empty-hdfstore-keys branch 2 times, most recently from 91eb111 to 9fd51cb Compare March 14, 2020 13:23
@jreback
Copy link
Contributor

jreback commented Mar 14, 2020

is this better handled upstream in dask? I am ok with a keyword to .keys but cannot change the behavior like this.

An additional kind parameter has been added that defaults to pandas
original behavior, but with 'tables' value gives you the list of
non-pandas tables in the file
- improve type annotation of the HDFStore.keys method
- minor improvement in ValueError string
- minor improvement in doc-string
- first try new behavior
- if no result return native HDF5 Tables
@roberthdevries roberthdevries force-pushed the fix-empty-hdfstore-keys branch from 9fd51cb to a903c6e Compare March 15, 2020 07:25
@roberthdevries
Copy link
Contributor Author

I have created a new PR #32723 which has the extra 'include' keyword.
I haven't added the 'all' value as mixing pandas tables and native tables in one file is an anti-pattern (@jreback 's words), which should not be encouraged.

@st-bender
Copy link

is this better handled upstream in dask?

I'll try to find out, there was not a lot of response so far, so I'll ping and ask.

I am ok with a keyword to .keys but cannot change the behavior like this.

I understand and I am with you wrt not changing the API behavoiur.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2020

I am not sure it is worthile to change anything here. The fact that HDFStore doesn't recognize non-pandas tables is a non-goal here.

@jamesbooker
Copy link

jamesbooker commented Jun 10, 2020

I am not sure it is worthile to change anything here. The fact that HDFStore doesn't recognize non-pandas tables is a non-goal here.

This is a disappointing attitude, because this used to work. It's not an enhancement, but a regression that was broken by an 'optimisation'

Groups and keys aren't a feature of pandas, but a feature of the HDF5 standard. If pandas is going to claim support for the HDF5 file format, then it should support the whole format. This isn't even restricted to the dataframe/tables level. You're basically stating the equivalent of 'supporting files stored in a directory not created by pandas is a non-goal'.

At the dataframe level, I don't understand why you'd not want the ability to load data from a non-pandas table and create a dataframe from it - pandas isn't the only library out there for working with HDF5, and it's safe to assume not all users will be working with data that came out of pandas in the first place (got my hand up here.)

I do, however understand why you'd not want to be converting back from a dataframe to a 'non-pandas table' - being able to read data from a non-pandas HDF5 but only write pandas tables makes some kind of sense.

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Just one person's opinion.

EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

@st-bender
Copy link

I am not sure it is worthile to change anything here. The fact that HDFStore doesn't recognize non-pandas tables is a non-goal here.

This is a disappointing attitude, because this used to work. It's not an enhancement, but a regression that was broken by an 'optimisation'

Groups and keys aren't a feature of pandas, but a feature of the HDF5 standard. If pandas is going to claim support for the HDF5 file format, then it should support the whole format. This isn't even restricted to the dataframe/tables level. You're basically stating the equivalent of 'supporting files stored in a directory not created by pandas is a non-goal'.

At the dataframe level, I don't understand why you'd not want the ability to load data from a non-pandas table and create a dataframe from it - pandas isn't the only library out there for working with HDF5, and it's safe to assume not all users will be working with data that came out of pandas in the first place (got my hand up here.)

Sorry to jump in here, but importing non-pandas files should work just fine and was not the issue here. It is just not possible to list all the keys/groups with standard pandas anymore which lead to downstream problems, most notable with dask. You can have a look at the PR here or at the linked dask issue on how to list all of them with pytables.

If you have trouble importing hdf5 files in general, then that may be a separate issue.

[...]

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Yeah I cannot understand that either, but it is the developers' decision to cripple hdf5 support to pandas specific files.

Just one person's opinion.

EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

That would be a really bad decision indeed.

@jamesbooker
Copy link

Sorry to jump in here,

No need to apologise. It is a discussion, after all

but importing non-pandas files should work just fine and was not the issue here. It is just not possible to list all the keys/groups with standard pandas anymore which lead to downstream problems, most notable with dask. You can have a look at the PR here or at the linked dask issue on how to list all of them with pytables.

The problem isn't just with dask (I'm not using dask). I am trying to import data from non-pandas HDF5 files, to be put into a dataframe, which would then be exported to pandas HDF5 files. My issue was with not being able to list the keys/groups - I have no control of the format of the data coming in, and I need to 'walk' the groups to figure out what's in the file (the file format has a versioning system in it, so I need to walk the groups to figure out how many there are, and where to pull my specific dataset from) - so I guess you could say I have the same issue as dask. They've implemented a workaround which doesn't use the public API of pandas (by drilling down directly into pytables)

If you have trouble importing hdf5 files in general, then that may be a separate issue.

No, importing the data from a pandas file works fine, my issue is with groups/keys also.

[...]

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Yeah I cannot understand that either, but it is the developers' decision to cripple hdf5 support to pandas specific files.

The biggest issue I have is the 'we are not adding new methods' statement. I understand pandas wants a clean API, and certainly doesn't want to roll back a change made so long ago. What I can't fathom is why pandas couldn't have a seperate raw_keys() and raw_groups() method in addition to the keys() and groups() methods. This wouldn't break the existing API, and would allow us to still walk these groups while still keeping in place this 'optimisation'

Just one person's opinion.
EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

That would be a really bad decision indeed.

What would be a bad decision? I don't understand this last statement.

Thanks for your insight

@st-bender
Copy link

Sorry to jump in here,

No need to apologise. It is a discussion, after all

I believe that the discussion should take place at the issue, not here.

but importing non-pandas files should work just fine and was not the issue here. It is just not possible to list all the keys/groups with standard pandas anymore which lead to downstream problems, most notable with dask. You can have a look at the PR here or at the linked dask issue on how to list all of them with pytables.

The problem isn't just with dask (I'm not using dask). I am trying to import data from non-pandas HDF5 files, to be put into a dataframe, which would then be exported to pandas HDF5 files. My issue was with not being able to list the keys/groups - I have no control of the format of the data coming in, and I need to 'walk' the groups to figure out what's in the file (the file format has a versioning system in it, so I need to walk the groups to figure out how many there are, and where to pull my specific dataset from) - so I guess you could say I have the same issue as dask. They've implemented a workaround which doesn't use the public API of pandas (by drilling down directly into pytables)

Probably then dask may be actually the right tool, it allows wildcards in the key argument. Otherwise you could probably use plain pytables in the first round to get the keys, and then pandas to read them. I don't think that the issue will be fixed in pandas.

[...]

If pandas is insistent on following this line of thought, then maybe you could at least update the documentation to declare that HDF5 files written not written with pandas won't work with pandas (or declare a derivative file format, called PHDF5 or something and very specifically define what's supported and what's not)

Yeah I cannot understand that either, but it is the developers' decision to cripple hdf5 support to pandas specific files.

The biggest issue I have is the 'we are not adding new methods' statement. I understand pandas wants a clean API, and certainly doesn't want to roll back a change made so long ago. What I can't fathom is why pandas couldn't have a seperate raw_keys() and raw_groups() method in addition to the keys() and groups() methods. This wouldn't break the existing API, and would allow us to still walk these groups while still keeping in place this 'optimisation'

I understand the intention to not change the API. Hdf5 is apparently a very minor part such that nobody really bothers. The best would have been to revert the change and admit that it was screwed up. Would have saved a lot of developer resources.

Just one person's opinion.
EDIT: I've just noticed that the docs for read_hdf does say 'retrieve a pandas object stored in a file' so my comment about the docs above is only half-relevant

That would be a really bad decision indeed.

What would be a bad decision? I don't understand this last statement.

Sorry, my comment was about read_hdf only supporting pandas objects, which would be a really bad choice imo.

I think it is best to continue at the issue, and not fill this pull request further.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

ok I think I commented else where that If you want to support .keys(nodes='pandas|all') I think would be ok (and if .groups needs this keyword then ok too).

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

prefer soln in #32723

@jreback jreback closed this Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HDF5: empty groups and keys
4 participants