@@ -3279,88 +3279,89 @@ External Compatibility
3279
3279
``HDFStore `` writes ``table `` format objects in specific formats suitable for
3280
3280
producing loss-less round trips to pandas objects. For external
3281
3281
compatibility, ``HDFStore `` can read native ``PyTables `` format
3282
- tables. It is possible to write an ``HDFStore `` object that can easily
3283
- be imported into ``R `` using the
3282
+ tables.
3283
+
3284
+ It is possible to write an ``HDFStore `` object that can easily be imported into ``R `` using the
3284
3285
``rhdf5 `` library (`Package website `_). Create a table format store like this:
3285
3286
3286
3287
.. _package website : http://www.bioconductor.org/packages/release/bioc/html/rhdf5.html
3287
3288
3288
- .. ipython :: python
3289
+ .. ipython :: python
3290
+
3291
+ np.random.seed(1 )
3292
+ df_for_r = pd.DataFrame({" first" : np.random.rand(100 ),
3293
+ " second" : np.random.rand(100 ),
3294
+ " class" : np.random.randint(0 , 2 , (100 ,))},
3295
+ index = range (100 ))
3296
+ df_for_r.head()
3289
3297
3290
- np.random.seed(1 )
3291
- df_for_r = pd.DataFrame({" first" : np.random.rand(100 ),
3292
- " second" : np.random.rand(100 ),
3293
- " class" : np.random.randint(0 , 2 , (100 ,))},
3294
- index = range (100 ))
3295
- df_for_r.head()
3298
+ store_export = HDFStore(' export.h5' )
3299
+ store_export.append(' df_for_r' , df_for_r, data_columns = df_dc.columns)
3300
+ store_export
3296
3301
3297
- store_export = HDFStore(' export.h5' )
3298
- store_export.append(' df_for_r' , df_for_r, data_columns = df_dc.columns)
3299
- store_export
3302
+ .. ipython :: python
3303
+ :suppress:
3300
3304
3301
- .. ipython :: python
3302
- :suppress:
3305
+ store_export.close()
3306
+ import os
3307
+ os.remove(' export.h5' )
3303
3308
3304
- store_export.close()
3305
- import os
3306
- os.remove(' export.h5' )
3307
-
3308
3309
In R this file can be read into a ``data.frame `` object using the ``rhdf5 ``
3309
3310
library. The following example function reads the corresponding column names
3310
3311
and data values from the values and assembles them into a ``data.frame ``:
3311
3312
3312
- .. code-block :: R
3313
-
3314
- # Load values and column names for all datasets from corresponding nodes and
3315
- # insert them into one data.frame object.
3316
-
3317
- library(rhdf5)
3318
-
3319
- loadhdf5data <- function(h5File) {
3320
-
3321
- listing <- h5ls(h5File)
3322
- # Find all data nodes, values are stored in *_values and corresponding column
3323
- # titles in *_items
3324
- data_nodes <- grep("_values", listing$name)
3325
- name_nodes <- grep("_items", listing$name)
3326
- data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
3327
- name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
3328
- columns = list()
3329
- for (idx in seq(data_paths)) {
3330
- # NOTE: matrices returned by h5read have to be transposed to to obtain
3331
- # required Fortran order!
3332
- data <- data.frame(t(h5read(h5File, data_paths[idx])))
3333
- names <- t(h5read(h5File, name_paths[idx]))
3334
- entry <- data.frame(data)
3335
- colnames(entry) <- names
3336
- columns <- append(columns, entry)
3337
- }
3338
-
3339
- data <- data.frame(columns)
3340
-
3341
- return(data)
3342
- }
3313
+ .. code-block :: R
3314
+
3315
+ # Load values and column names for all datasets from corresponding nodes and
3316
+ # insert them into one data.frame object.
3317
+
3318
+ library(rhdf5)
3319
+
3320
+ loadhdf5data <- function(h5File) {
3321
+
3322
+ listing <- h5ls(h5File)
3323
+ # Find all data nodes, values are stored in *_values and corresponding column
3324
+ # titles in *_items
3325
+ data_nodes <- grep("_values", listing$name)
3326
+ name_nodes <- grep("_items", listing$name)
3327
+ data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
3328
+ name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
3329
+ columns = list()
3330
+ for (idx in seq(data_paths)) {
3331
+ # NOTE: matrices returned by h5read have to be transposed to to obtain
3332
+ # required Fortran order!
3333
+ data <- data.frame(t(h5read(h5File, data_paths[idx])))
3334
+ names <- t(h5read(h5File, name_paths[idx]))
3335
+ entry <- data.frame(data)
3336
+ colnames(entry) <- names
3337
+ columns <- append(columns, entry)
3338
+ }
3339
+
3340
+ data <- data.frame(columns)
3341
+
3342
+ return(data)
3343
+ }
3343
3344
3344
3345
Now you can import the ``DataFrame `` into R:
3345
3346
3346
- .. code-block :: R
3347
-
3348
- > data = loadhdf5data("transfer.hdf5")
3349
- > head(data)
3350
- first second class
3351
- 1 0.4170220047 0.3266449 0
3352
- 2 0.7203244934 0.5270581 0
3353
- 3 0.0001143748 0.8859421 1
3354
- 4 0.3023325726 0.3572698 1
3355
- 5 0.1467558908 0.9085352 1
3356
- 6 0.0923385948 0.6233601 1
3357
-
3347
+ .. code-block :: R
3348
+
3349
+ > data = loadhdf5data("transfer.hdf5")
3350
+ > head(data)
3351
+ first second class
3352
+ 1 0.4170220047 0.3266449 0
3353
+ 2 0.7203244934 0.5270581 0
3354
+ 3 0.0001143748 0.8859421 1
3355
+ 4 0.3023325726 0.3572698 1
3356
+ 5 0.1467558908 0.9085352 1
3357
+ 6 0.0923385948 0.6233601 1
3358
+
3358
3359
.. note ::
3359
3360
The R function lists the entire HDF5 file's contents and assembles the
3360
3361
``data.frame `` object from all matching nodes, so use this only as a
3361
3362
starting point if you have stored multiple ``DataFrame `` objects to a
3362
3363
single HDF5 file.
3363
-
3364
+
3364
3365
Backwards Compatibility
3365
3366
~~~~~~~~~~~~~~~~~~~~~~~
3366
3367
@@ -3374,53 +3375,53 @@ method ``copy`` to take advantage of the updates. The group attribute
3374
3375
number of options, please see the docstring.
3375
3376
3376
3377
3377
- .. ipython :: python
3378
- :suppress:
3378
+ .. ipython :: python
3379
+ :suppress:
3379
3380
3380
- import os
3381
- legacy_file_path = os.path.abspath(' source/_static/legacy_0.10.h5' )
3381
+ import os
3382
+ legacy_file_path = os.path.abspath(' source/_static/legacy_0.10.h5' )
3382
3383
3383
- .. ipython :: python
3384
+ .. ipython :: python
3384
3385
3385
- # a legacy store
3386
- legacy_store = HDFStore(legacy_file_path,' r' )
3387
- legacy_store
3386
+ # a legacy store
3387
+ legacy_store = HDFStore(legacy_file_path,' r' )
3388
+ legacy_store
3388
3389
3389
- # copy (and return the new handle)
3390
- new_store = legacy_store.copy(' store_new.h5' )
3391
- new_store
3392
- new_store.close()
3390
+ # copy (and return the new handle)
3391
+ new_store = legacy_store.copy(' store_new.h5' )
3392
+ new_store
3393
+ new_store.close()
3393
3394
3394
- .. ipython :: python
3395
- :suppress:
3395
+ .. ipython :: python
3396
+ :suppress:
3396
3397
3397
- legacy_store.close()
3398
- import os
3399
- os.remove(' store_new.h5' )
3398
+ legacy_store.close()
3399
+ import os
3400
+ os.remove(' store_new.h5' )
3400
3401
3401
3402
3402
3403
Performance
3403
3404
~~~~~~~~~~~
3404
3405
3405
- - ``Tables `` come with a writing performance penalty as compared to
3406
- regular stores. The benefit is the ability to append/delete and
3407
- query (potentially very large amounts of data). Write times are
3408
- generally longer as compared with regular stores. Query times can
3409
- be quite fast, especially on an indexed axis.
3410
- - You can pass ``chunksize=<int> `` to ``append ``, specifying the
3411
- write chunksize (default is 50000). This will significantly lower
3412
- your memory usage on writing.
3413
- - You can pass ``expectedrows=<int> `` to the first ``append ``,
3414
- to set the TOTAL number of expected rows that ``PyTables `` will
3415
- expected. This will optimize read/write performance.
3416
- - Duplicate rows can be written to tables, but are filtered out in
3417
- selection (with the last items being selected; thus a table is
3418
- unique on major, minor pairs)
3419
- - A ``PerformanceWarning `` will be raised if you are attempting to
3420
- store types that will be pickled by PyTables (rather than stored as
3421
- endemic types). See
3422
- `Here <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190 >`__
3423
- for more information and some solutions.
3406
+ - ``tables `` format come with a writing performance penalty as compared to
3407
+ `` fixed `` stores. The benefit is the ability to append/delete and
3408
+ query (potentially very large amounts of data). Write times are
3409
+ generally longer as compared with regular stores. Query times can
3410
+ be quite fast, especially on an indexed axis.
3411
+ - You can pass ``chunksize=<int> `` to ``append ``, specifying the
3412
+ write chunksize (default is 50000). This will significantly lower
3413
+ your memory usage on writing.
3414
+ - You can pass ``expectedrows=<int> `` to the first ``append ``,
3415
+ to set the TOTAL number of expected rows that ``PyTables `` will
3416
+ expected. This will optimize read/write performance.
3417
+ - Duplicate rows can be written to tables, but are filtered out in
3418
+ selection (with the last items being selected; thus a table is
3419
+ unique on major, minor pairs)
3420
+ - A ``PerformanceWarning `` will be raised if you are attempting to
3421
+ store types that will be pickled by PyTables (rather than stored as
3422
+ endemic types). See
3423
+ `Here <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190 >`__
3424
+ for more information and some solutions.
3424
3425
3425
3426
Experimental
3426
3427
~~~~~~~~~~~~
0 commit comments