Skip to content

Commit d7d868f

Browse files
committed
DOC: io.rst fixups
1 parent dab318f commit d7d868f

File tree

1 file changed

+99
-98
lines changed

1 file changed

+99
-98
lines changed

doc/source/io.rst

+99-98
Original file line numberDiff line numberDiff line change
@@ -3279,88 +3279,89 @@ External Compatibility
32793279
``HDFStore`` writes ``table`` format objects in specific formats suitable for
32803280
producing loss-less round trips to pandas objects. For external
32813281
compatibility, ``HDFStore`` can read native ``PyTables`` format
3282-
tables. It is possible to write an ``HDFStore`` object that can easily
3283-
be imported into ``R`` using the
3282+
tables.
3283+
3284+
It is possible to write an ``HDFStore`` object that can easily be imported into ``R`` using the
32843285
``rhdf5`` library (`Package website`_). Create a table format store like this:
32853286

32863287
.. _package website: http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
32873288

3288-
.. ipython:: python
3289+
.. ipython:: python
3290+
3291+
np.random.seed(1)
3292+
df_for_r = pd.DataFrame({"first": np.random.rand(100),
3293+
"second": np.random.rand(100),
3294+
"class": np.random.randint(0, 2, (100,))},
3295+
index=range(100))
3296+
df_for_r.head()
32893297
3290-
np.random.seed(1)
3291-
df_for_r = pd.DataFrame({"first": np.random.rand(100),
3292-
"second": np.random.rand(100),
3293-
"class": np.random.randint(0, 2, (100,))},
3294-
index=range(100))
3295-
df_for_r.head()
3298+
store_export = HDFStore('export.h5')
3299+
store_export.append('df_for_r', df_for_r, data_columns=df_dc.columns)
3300+
store_export
32963301
3297-
store_export = HDFStore('export.h5')
3298-
store_export.append('df_for_r', df_for_r, data_columns=df_dc.columns)
3299-
store_export
3302+
.. ipython:: python
3303+
:suppress:
33003304
3301-
.. ipython:: python
3302-
:suppress:
3305+
store_export.close()
3306+
import os
3307+
os.remove('export.h5')
33033308
3304-
store_export.close()
3305-
import os
3306-
os.remove('export.h5')
3307-
33083309
In R this file can be read into a ``data.frame`` object using the ``rhdf5``
33093310
library. The following example function reads the corresponding column names
33103311
and data values from the values and assembles them into a ``data.frame``:
33113312

3312-
.. code-block:: R
3313-
3314-
# Load values and column names for all datasets from corresponding nodes and
3315-
# insert them into one data.frame object.
3316-
3317-
library(rhdf5)
3318-
3319-
loadhdf5data <- function(h5File) {
3320-
3321-
listing <- h5ls(h5File)
3322-
# Find all data nodes, values are stored in *_values and corresponding column
3323-
# titles in *_items
3324-
data_nodes <- grep("_values", listing$name)
3325-
name_nodes <- grep("_items", listing$name)
3326-
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
3327-
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
3328-
columns = list()
3329-
for (idx in seq(data_paths)) {
3330-
# NOTE: matrices returned by h5read have to be transposed to to obtain
3331-
# required Fortran order!
3332-
data <- data.frame(t(h5read(h5File, data_paths[idx])))
3333-
names <- t(h5read(h5File, name_paths[idx]))
3334-
entry <- data.frame(data)
3335-
colnames(entry) <- names
3336-
columns <- append(columns, entry)
3337-
}
3338-
3339-
data <- data.frame(columns)
3340-
3341-
return(data)
3342-
}
3313+
.. code-block:: R
3314+
3315+
# Load values and column names for all datasets from corresponding nodes and
3316+
# insert them into one data.frame object.
3317+
3318+
library(rhdf5)
3319+
3320+
loadhdf5data <- function(h5File) {
3321+
3322+
listing <- h5ls(h5File)
3323+
# Find all data nodes, values are stored in *_values and corresponding column
3324+
# titles in *_items
3325+
data_nodes <- grep("_values", listing$name)
3326+
name_nodes <- grep("_items", listing$name)
3327+
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
3328+
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
3329+
columns = list()
3330+
for (idx in seq(data_paths)) {
3331+
# NOTE: matrices returned by h5read have to be transposed to to obtain
3332+
# required Fortran order!
3333+
data <- data.frame(t(h5read(h5File, data_paths[idx])))
3334+
names <- t(h5read(h5File, name_paths[idx]))
3335+
entry <- data.frame(data)
3336+
colnames(entry) <- names
3337+
columns <- append(columns, entry)
3338+
}
3339+
3340+
data <- data.frame(columns)
3341+
3342+
return(data)
3343+
}
33433344
33443345
Now you can import the ``DataFrame`` into R:
33453346

3346-
.. code-block:: R
3347-
3348-
> data = loadhdf5data("transfer.hdf5")
3349-
> head(data)
3350-
first second class
3351-
1 0.4170220047 0.3266449 0
3352-
2 0.7203244934 0.5270581 0
3353-
3 0.0001143748 0.8859421 1
3354-
4 0.3023325726 0.3572698 1
3355-
5 0.1467558908 0.9085352 1
3356-
6 0.0923385948 0.6233601 1
3357-
3347+
.. code-block:: R
3348+
3349+
> data = loadhdf5data("transfer.hdf5")
3350+
> head(data)
3351+
first second class
3352+
1 0.4170220047 0.3266449 0
3353+
2 0.7203244934 0.5270581 0
3354+
3 0.0001143748 0.8859421 1
3355+
4 0.3023325726 0.3572698 1
3356+
5 0.1467558908 0.9085352 1
3357+
6 0.0923385948 0.6233601 1
3358+
33583359
.. note::
33593360
The R function lists the entire HDF5 file's contents and assembles the
33603361
``data.frame`` object from all matching nodes, so use this only as a
33613362
starting point if you have stored multiple ``DataFrame`` objects to a
33623363
single HDF5 file.
3363-
3364+
33643365
Backwards Compatibility
33653366
~~~~~~~~~~~~~~~~~~~~~~~
33663367

@@ -3374,53 +3375,53 @@ method ``copy`` to take advantage of the updates. The group attribute
33743375
number of options, please see the docstring.
33753376

33763377

3377-
.. ipython:: python
3378-
:suppress:
3378+
.. ipython:: python
3379+
:suppress:
33793380
3380-
import os
3381-
legacy_file_path = os.path.abspath('source/_static/legacy_0.10.h5')
3381+
import os
3382+
legacy_file_path = os.path.abspath('source/_static/legacy_0.10.h5')
33823383
3383-
.. ipython:: python
3384+
.. ipython:: python
33843385
3385-
# a legacy store
3386-
legacy_store = HDFStore(legacy_file_path,'r')
3387-
legacy_store
3386+
# a legacy store
3387+
legacy_store = HDFStore(legacy_file_path,'r')
3388+
legacy_store
33883389
3389-
# copy (and return the new handle)
3390-
new_store = legacy_store.copy('store_new.h5')
3391-
new_store
3392-
new_store.close()
3390+
# copy (and return the new handle)
3391+
new_store = legacy_store.copy('store_new.h5')
3392+
new_store
3393+
new_store.close()
33933394
3394-
.. ipython:: python
3395-
:suppress:
3395+
.. ipython:: python
3396+
:suppress:
33963397
3397-
legacy_store.close()
3398-
import os
3399-
os.remove('store_new.h5')
3398+
legacy_store.close()
3399+
import os
3400+
os.remove('store_new.h5')
34003401
34013402
34023403
Performance
34033404
~~~~~~~~~~~
34043405

3405-
- ``Tables`` come with a writing performance penalty as compared to
3406-
regular stores. The benefit is the ability to append/delete and
3407-
query (potentially very large amounts of data). Write times are
3408-
generally longer as compared with regular stores. Query times can
3409-
be quite fast, especially on an indexed axis.
3410-
- You can pass ``chunksize=<int>`` to ``append``, specifying the
3411-
write chunksize (default is 50000). This will significantly lower
3412-
your memory usage on writing.
3413-
- You can pass ``expectedrows=<int>`` to the first ``append``,
3414-
to set the TOTAL number of expected rows that ``PyTables`` will
3415-
expected. This will optimize read/write performance.
3416-
- Duplicate rows can be written to tables, but are filtered out in
3417-
selection (with the last items being selected; thus a table is
3418-
unique on major, minor pairs)
3419-
- A ``PerformanceWarning`` will be raised if you are attempting to
3420-
store types that will be pickled by PyTables (rather than stored as
3421-
endemic types). See
3422-
`Here <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190>`__
3423-
for more information and some solutions.
3406+
- ``tables`` format come with a writing performance penalty as compared to
3407+
``fixed`` stores. The benefit is the ability to append/delete and
3408+
query (potentially very large amounts of data). Write times are
3409+
generally longer as compared with regular stores. Query times can
3410+
be quite fast, especially on an indexed axis.
3411+
- You can pass ``chunksize=<int>`` to ``append``, specifying the
3412+
write chunksize (default is 50000). This will significantly lower
3413+
your memory usage on writing.
3414+
- You can pass ``expectedrows=<int>`` to the first ``append``,
3415+
to set the TOTAL number of expected rows that ``PyTables`` will
3416+
expected. This will optimize read/write performance.
3417+
- Duplicate rows can be written to tables, but are filtered out in
3418+
selection (with the last items being selected; thus a table is
3419+
unique on major, minor pairs)
3420+
- A ``PerformanceWarning`` will be raised if you are attempting to
3421+
store types that will be pickled by PyTables (rather than stored as
3422+
endemic types). See
3423+
`Here <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190>`__
3424+
for more information and some solutions.
34243425

34253426
Experimental
34263427
~~~~~~~~~~~~

0 commit comments

Comments
 (0)