Skip to content

Commit dab318f

Browse files
Joschka zur Jacobsmühlenjreback
Joschka zur Jacobsmühlen
authored andcommitted
DOC: example of pandas to R transfer of DataFrame using HDF5 file
1 parent 482643a commit dab318f

File tree

3 files changed

+79
-6
lines changed

3 files changed

+79
-6
lines changed

doc/source/comparison_with_r.rst

+4
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ libraries, we care about the following things:
2727
This page is also here to offer a bit of a translation guide for users of these
2828
R packages.
2929

30+
For transfer of ``DataFrame`` objects from ``pandas`` to R, one option is to
31+
use HDF5 files, see :ref:`io.external_compatibility` for an
32+
example.
33+
3034
Base R
3135
------
3236

doc/source/io.rst

+72-5
Original file line numberDiff line numberDiff line change
@@ -3271,20 +3271,31 @@ You could inadvertently turn an actual ``nan`` value into a missing value.
32713271
store.append('dfss2', dfss, nan_rep='_nan_')
32723272
store.select('dfss2')
32733273
3274+
.. _io.external_compatibility:
3275+
32743276
External Compatibility
32753277
~~~~~~~~~~~~~~~~~~~~~~
32763278

3277-
``HDFStore`` write ``table`` format objects in specific formats suitable for
3279+
``HDFStore`` writes ``table`` format objects in specific formats suitable for
32783280
producing loss-less round trips to pandas objects. For external
32793281
compatibility, ``HDFStore`` can read native ``PyTables`` format
32803282
tables. It is possible to write an ``HDFStore`` object that can easily
3281-
be imported into ``R`` using the ``rhdf5`` library. Create a table
3282-
format store like this:
3283+
be imported into ``R`` using the
3284+
``rhdf5`` library (`Package website`_). Create a table format store like this:
3285+
3286+
.. _package website: http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
32833287

32843288
.. ipython:: python
32853289
3290+
np.random.seed(1)
3291+
df_for_r = pd.DataFrame({"first": np.random.rand(100),
3292+
"second": np.random.rand(100),
3293+
"class": np.random.randint(0, 2, (100,))},
3294+
index=range(100))
3295+
df_for_r.head()
3296+
32863297
store_export = HDFStore('export.h5')
3287-
store_export.append('df_dc', df_dc, data_columns=df_dc.columns)
3298+
store_export.append('df_for_r', df_for_r, data_columns=df_dc.columns)
32883299
store_export
32893300
32903301
.. ipython:: python
@@ -3293,7 +3304,63 @@ format store like this:
32933304
store_export.close()
32943305
import os
32953306
os.remove('export.h5')
3296-
3307+
3308+
In R this file can be read into a ``data.frame`` object using the ``rhdf5``
3309+
library. The following example function reads the corresponding column names
3310+
and data values from the values and assembles them into a ``data.frame``:
3311+
3312+
.. code-block:: R
3313+
3314+
# Load values and column names for all datasets from corresponding nodes and
3315+
# insert them into one data.frame object.
3316+
3317+
library(rhdf5)
3318+
3319+
loadhdf5data <- function(h5File) {
3320+
3321+
listing <- h5ls(h5File)
3322+
# Find all data nodes, values are stored in *_values and corresponding column
3323+
# titles in *_items
3324+
data_nodes <- grep("_values", listing$name)
3325+
name_nodes <- grep("_items", listing$name)
3326+
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
3327+
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
3328+
columns = list()
3329+
for (idx in seq(data_paths)) {
3330+
# NOTE: matrices returned by h5read have to be transposed to to obtain
3331+
# required Fortran order!
3332+
data <- data.frame(t(h5read(h5File, data_paths[idx])))
3333+
names <- t(h5read(h5File, name_paths[idx]))
3334+
entry <- data.frame(data)
3335+
colnames(entry) <- names
3336+
columns <- append(columns, entry)
3337+
}
3338+
3339+
data <- data.frame(columns)
3340+
3341+
return(data)
3342+
}
3343+
3344+
Now you can import the ``DataFrame`` into R:
3345+
3346+
.. code-block:: R
3347+
3348+
> data = loadhdf5data("transfer.hdf5")
3349+
> head(data)
3350+
first second class
3351+
1 0.4170220047 0.3266449 0
3352+
2 0.7203244934 0.5270581 0
3353+
3 0.0001143748 0.8859421 1
3354+
4 0.3023325726 0.3572698 1
3355+
5 0.1467558908 0.9085352 1
3356+
6 0.0923385948 0.6233601 1
3357+
3358+
.. note::
3359+
The R function lists the entire HDF5 file's contents and assembles the
3360+
``data.frame`` object from all matching nodes, so use this only as a
3361+
starting point if you have stored multiple ``DataFrame`` objects to a
3362+
single HDF5 file.
3363+
32973364
Backwards Compatibility
32983365
~~~~~~~~~~~~~~~~~~~~~~~
32993366

doc/source/whatsnew/v0.16.0.txt

+3-1
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,9 @@ Other enhancements
198198
- Added ``days_in_month`` (compatibility alias ``daysinmonth``) property to ``Timestamp``, ``DatetimeIndex``, ``Period``, ``PeriodIndex``, and ``Series.dt`` (:issue:`9572`)
199199
- Added ``decimal`` option in ``to_csv`` to provide formatting for non-'.' decimal separators (:issue:`781`)
200200
- Added ``normalize`` option for ``Timestamp`` to normalized to midnight (:issue:`8794`)
201-
201+
- Added example for ``DataFrame`` import to R using HDF5 file and ``rhdf5``
202+
library. See the :ref:`documentation <io.external_compatibility>` for more
203+
(:issue:`9636`).
202204

203205
.. _whatsnew_0160.api:
204206

0 commit comments

Comments
 (0)