Skip to content

DOC: HDF5 External Compatibility section examples don't work #35419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ivirshup opened this issue Jul 27, 2020 · 1 comment · Fixed by #49088
Closed

DOC: HDF5 External Compatibility section examples don't work #35419

ivirshup opened this issue Jul 27, 2020 · 1 comment · Fixed by #49088
Labels
Docs IO HDF5 read_hdf, HDFStore

Comments

@ivirshup
Copy link
Contributor

Location of the documentation

https://pandas.pydata.org/docs/user_guide/io.html#external-compatibility.

Documentation problem

This section probably has at least one typo, but more generally, doesn't seem to be documenting current behaviour.

I'll quickly run through the example here, but with a bit of cleaning so we don't have to run the entire page.

import pandas as pd
import numpy as np

df_for_r = pd.DataFrame({"first": np.random.rand(100),
                         "second": np.random.rand(100),
                         "class": np.random.randint(0, 2, (100, ))},
                        index=range(100))


store_export = pd.HDFStore('export.h5')

# In the documentation, this is written with 'data_columns=df_dc.columns', which I'm assuming is a mistake
store_export.append('df_for_r', df_for_r, data_columns=df_for_r.columns)

store_export

We can take a look at what's in this file:

store_export.close()
!h5ls -r export.h5
Output
/                        Group
/df_for_r                Group
/df_for_r/_i_table       Group
/df_for_r/_i_table/class Group
/df_for_r/_i_table/class/abounds Dataset {0/Inf}
/df_for_r/_i_table/class/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/class/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/indicesLR Dataset {131072}
/df_for_r/_i_table/class/mbounds Dataset {0/Inf}
/df_for_r/_i_table/class/mranges Dataset {0/Inf}
/df_for_r/_i_table/class/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/class/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/sortedLR Dataset {131201}
/df_for_r/_i_table/class/zbounds Dataset {0/Inf}
/df_for_r/_i_table/first Group
/df_for_r/_i_table/first/abounds Dataset {0/Inf}
/df_for_r/_i_table/first/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/first/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/indicesLR Dataset {131072}
/df_for_r/_i_table/first/mbounds Dataset {0/Inf}
/df_for_r/_i_table/first/mranges Dataset {0/Inf}
/df_for_r/_i_table/first/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/first/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/sortedLR Dataset {131201}
/df_for_r/_i_table/first/zbounds Dataset {0/Inf}
/df_for_r/_i_table/index Group
/df_for_r/_i_table/index/abounds Dataset {0/Inf}
/df_for_r/_i_table/index/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/index/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/indicesLR Dataset {131072}
/df_for_r/_i_table/index/mbounds Dataset {0/Inf}
/df_for_r/_i_table/index/mranges Dataset {0/Inf}
/df_for_r/_i_table/index/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/index/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/sortedLR Dataset {131201}
/df_for_r/_i_table/index/zbounds Dataset {0/Inf}
/df_for_r/_i_table/second Group
/df_for_r/_i_table/second/abounds Dataset {0/Inf}
/df_for_r/_i_table/second/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/second/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/indicesLR Dataset {131072}
/df_for_r/_i_table/second/mbounds Dataset {0/Inf}
/df_for_r/_i_table/second/mranges Dataset {0/Inf}
/df_for_r/_i_table/second/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/second/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/sortedLR Dataset {131201}
/df_for_r/_i_table/second/zbounds Dataset {0/Inf}
/df_for_r/table          Dataset {200/Inf}

Next, there is an R function for reading in this data. Just from comparing the given function to the written file I think we can see there is a mismatch:

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
  # NOTE: matrices returned by h5read have to be transposed to obtain
  # required Fortran order!
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

For example, there are no entries in export.h5 which have _values or _items in the names.

If we actually call this function, we get an empty dataframe back:

> loadhdf5data("export.h5")
data frame with 0 columns and 0 rows

This function does seem to work if the file is written using "fixed" format

df_for_r.to_hdf("export2.h5", key="df_for_r", format="fixed")  
> loadhdf5data("export2.h5")
          first      second class
1   0.675013759 0.787289926     0
2   0.936797348 0.349671699     1
3   0.951930811 0.275965069     0
4   0.203085530 0.380154180     0
5   0.627195223 0.462702969     1
6   0.129148756 0.385663581     1
...

This is a bit contrary to the prose for this section which reads:

HDFStore writes table format objects in specific formats suitable for producing loss-less round trips to pandas objects. For external compatibility, HDFStore can read native PyTables format tables.

It is possible to write an HDFStore object that can easily be imported into R using the rhdf5 library (Package website). Create a table format store like this:

Suggested fix for documentation

This should probably specify that the "table" format doesn't work here. In addition, since external compatibility relies on the user writing code to read this format, maybe a specification for the format should be documented here?

@ivirshup ivirshup added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2020
@jbrockmendel jbrockmendel added the IO HDF5 read_hdf, HDFStore label Sep 2, 2020
@TomAugspurger
Copy link
Contributor

I'd recommend removing that section and pointing people to projects like

  • parquet for tabular data
  • H5Py if you're using simple types and need to use HDF5

@TomAugspurger TomAugspurger removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants