DOC: HDF5 External Compatibility section examples don't work #35419

ivirshup · 2020-07-27T05:39:21Z

Location of the documentation

https://pandas.pydata.org/docs/user_guide/io.html#external-compatibility.

Documentation problem

This section probably has at least one typo, but more generally, doesn't seem to be documenting current behaviour.

I'll quickly run through the example here, but with a bit of cleaning so we don't have to run the entire page.

import pandas as pd
import numpy as np

df_for_r = pd.DataFrame({"first": np.random.rand(100),
                         "second": np.random.rand(100),
                         "class": np.random.randint(0, 2, (100, ))},
                        index=range(100))


store_export = pd.HDFStore('export.h5')

# In the documentation, this is written with 'data_columns=df_dc.columns', which I'm assuming is a mistake
store_export.append('df_for_r', df_for_r, data_columns=df_for_r.columns)

store_export

We can take a look at what's in this file:

store_export.close()
!h5ls -r export.h5

Output

/                        Group
/df_for_r                Group
/df_for_r/_i_table       Group
/df_for_r/_i_table/class Group
/df_for_r/_i_table/class/abounds Dataset {0/Inf}
/df_for_r/_i_table/class/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/class/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/indicesLR Dataset {131072}
/df_for_r/_i_table/class/mbounds Dataset {0/Inf}
/df_for_r/_i_table/class/mranges Dataset {0/Inf}
/df_for_r/_i_table/class/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/class/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/class/sortedLR Dataset {131201}
/df_for_r/_i_table/class/zbounds Dataset {0/Inf}
/df_for_r/_i_table/first Group
/df_for_r/_i_table/first/abounds Dataset {0/Inf}
/df_for_r/_i_table/first/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/first/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/indicesLR Dataset {131072}
/df_for_r/_i_table/first/mbounds Dataset {0/Inf}
/df_for_r/_i_table/first/mranges Dataset {0/Inf}
/df_for_r/_i_table/first/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/first/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/first/sortedLR Dataset {131201}
/df_for_r/_i_table/first/zbounds Dataset {0/Inf}
/df_for_r/_i_table/index Group
/df_for_r/_i_table/index/abounds Dataset {0/Inf}
/df_for_r/_i_table/index/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/index/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/indicesLR Dataset {131072}
/df_for_r/_i_table/index/mbounds Dataset {0/Inf}
/df_for_r/_i_table/index/mranges Dataset {0/Inf}
/df_for_r/_i_table/index/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/index/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/index/sortedLR Dataset {131201}
/df_for_r/_i_table/index/zbounds Dataset {0/Inf}
/df_for_r/_i_table/second Group
/df_for_r/_i_table/second/abounds Dataset {0/Inf}
/df_for_r/_i_table/second/bounds Dataset {0/Inf, 127}
/df_for_r/_i_table/second/indices Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/indicesLR Dataset {131072}
/df_for_r/_i_table/second/mbounds Dataset {0/Inf}
/df_for_r/_i_table/second/mranges Dataset {0/Inf}
/df_for_r/_i_table/second/ranges Dataset {0/Inf, 2}
/df_for_r/_i_table/second/sorted Dataset {0/Inf, 131072}
/df_for_r/_i_table/second/sortedLR Dataset {131201}
/df_for_r/_i_table/second/zbounds Dataset {0/Inf}
/df_for_r/table          Dataset {200/Inf}

Next, there is an R function for reading in this data. Just from comparing the given function to the written file I think we can see there is a mismatch:

library(rhdf5)

loadhdf5data <- function(h5File) {

listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
  # NOTE: matrices returned by h5read have to be transposed to obtain
  # required Fortran order!
  data <- data.frame(t(h5read(h5File, data_paths[idx])))
  names <- t(h5read(h5File, name_paths[idx]))
  entry <- data.frame(data)
  colnames(entry) <- names
  columns <- append(columns, entry)
}

data <- data.frame(columns)

return(data)
}

For example, there are no entries in export.h5 which have _values or _items in the names.

If we actually call this function, we get an empty dataframe back:

> loadhdf5data("export.h5")
data frame with 0 columns and 0 rows

This function does seem to work if the file is written using "fixed" format

df_for_r.to_hdf("export2.h5", key="df_for_r", format="fixed")

> loadhdf5data("export2.h5")
          first      second class
1   0.675013759 0.787289926     0
2   0.936797348 0.349671699     1
3   0.951930811 0.275965069     0
4   0.203085530 0.380154180     0
5   0.627195223 0.462702969     1
6   0.129148756 0.385663581     1
...

This is a bit contrary to the prose for this section which reads:

HDFStore writes table format objects in specific formats suitable for producing loss-less round trips to pandas objects. For external compatibility, HDFStore can read native PyTables format tables.

It is possible to write an HDFStore object that can easily be imported into R using the rhdf5 library (Package website). Create a table format store like this:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: HDF5 External Compatibility section examples don't work #35419

DOC: HDF5 External Compatibility section examples don't work #35419

ivirshup commented Jul 27, 2020

TomAugspurger commented Sep 4, 2020

DOC: HDF5 External Compatibility section examples don't work #35419

DOC: HDF5 External Compatibility section examples don't work #35419

Comments

ivirshup commented Jul 27, 2020

Location of the documentation

Documentation problem

Suggested fix for documentation

TomAugspurger commented Sep 4, 2020