-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Automatic detection of HDF5 dataset identifier fails when data contains categoricals #13231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Cc @laufere. |
yeah this detection needs to be a bit smarter to consider the uniques of the top-level groups (rather than just multiple keys). should be a straightforward fix. |
pull-requests welcome! |
@jreback While looking in the code, it seems that in such a case the list returned by store.keys() is empty which causes it to produce error.
Running the above code yields the results as shown above. Now if we compare this with a HDF5 file which actually has two datasets we get something like this:
As we can see, if there are actually two datasets, instead of meta(Group) , we get /data2(Group) where data2 is the key provided when writing to the file. |
@jreback One more thing comes to mind along the lines of what you suggested.
Now we can use these keys to get the unique values
We can see, that the uniques produced by giving /data/meta/values_block_1/meta as a key is a subset of when we provide /data. But if we go down this road we will also have to consider the key name while making a decision because it might happen that there are two dataframes in the hdf5 file where the uniques of one is a subset of other. Am I missing something here? |
@chrish42 you can just iterate over the groups with tables. look at how |
@jreback did you want to refer to me or to chrish42 only? |
oh sorry meant that as a general comment |
@jreback Using the approach you suggested, I can get key names and then use them to get the individual tables as well, but my question, like I asked in a comment before, is that using those tables, even if we get unique values from both the tables, how can we be certain that one of them has meta information just because the unique values of one of the tables is a subset of another? |
@jreback any comments?Can you guide me in the right direction? |
@pfrcks just look at the top-level groups. |
@jreback I'm working on this during (alone so far) the PyCon sprints. So far, I've set up a development environment and added a test that fails. I have a couple questions. First, should the metadata (like the categories, etc.) be hidden from the user by HDFStore or not? (i.e. the keys(), groups(), etc. method don't show the metadata table.) And second, is there way to know from the attributes or other that a table is a metadata table? What would be the best way to do this? I see that the HDFStore.groups() method already does a bunch of filtering out. Not sure what is the best way to do this for categorical metadata... |
we don't currently hide the metadata from the main display. Its prob ok to hide it (though do that after). when you are iterating over groups, you can tell if there is meta data by seeing if you are on a |
Let me know what you think of that pull request. Should I open a separate bug to hide the meta data? Also, while readings the tests for pytables IO, I noticed an (old?) |
We use HDF5 to store our pandas dataframes on disk. We only store one dataframe per HDF5, so the feature of pandas.read_hdf() that allows omitting the key when a HDF file contains a single Pandas object is very nice for our workflow.
However, said feature doesn't work when the dataframe saved contains one or more categorical columns:
It looks like this is because pandas.read_hdf() doesn't ignore the metadata used to store the categorical codes:
it'd be nice if this feature worked even when some of the columns are categoricals. It should be possible to ignore that metadata that pandas creates when looking if there is only one dataset stored, no?
The text was updated successfully, but these errors were encountered: