You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/source/io.rst
+175-14
Original file line number
Diff line number
Diff line change
@@ -1030,6 +1030,17 @@ Deletion of the object specified by the key
1030
1030
del store['wp']
1031
1031
1032
1032
store
1033
+
Closing a Store
1034
+
1035
+
.. ipython:: python
1036
+
1037
+
1038
+
# closing a store
1039
+
store.close()
1040
+
1041
+
# Working with, and automatically closing the store with the context manager.
1042
+
with get_store('store.h5') as store:
1043
+
store.keys()
1033
1044
1034
1045
.. ipython:: python
1035
1046
:suppress:
@@ -1095,14 +1106,19 @@ Storing Mixed Types in a Table
1095
1106
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1096
1107
1097
1108
Storing mixed-dtype data is supported. Strings are store as a fixed-width using the maximum size of the appended column. Subsequent appends will truncate strings at this length.
1098
-
Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set a larger minimum for the string columns. Storing ``floats, strings, ints, bools`` are currently supported.
1109
+
Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set a larger minimum for the string columns. Storing ``floats, strings, ints, bools, datetime64`` are currently supported. For string columns, passing ``nan_rep = 'my_nan_rep'`` to append will change the default nan representation on disk (which converts to/from `np.nan`), this defaults to `nan`.
@@ -1112,10 +1128,33 @@ Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set
1112
1128
# we have provided a minimum string column size
1113
1129
store.root.df_mixed.table
1114
1130
1131
+
It is ok to store ``np.nan`` in a ``float or string``. Make sure to do a ``convert_objects()`` on the frame before storing a ``np.nan`` in a datetime64 column. Storing a column with a ``np.nan`` in a ``int, bool`` will currently throw an ``Exception`` as these columns will have converted to ``object`` type.
1132
+
1133
+
Storing Multi-Index DataFrames
1134
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1135
+
1136
+
Storing multi-index dataframes as tables is very similar to storing/selecting from homogenous index DataFrames.
1137
+
1138
+
.. ipython:: python
1139
+
1140
+
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
The ``columns`` keyword can be supplied to select to filter a list of the return columns, this is equivalent to passing a ``Term('columns',list_of_columns_to_filter)``
1186
+
1187
+
.. ipython:: python
1188
+
1189
+
store.select('df', columns= ['A','B'])
1190
+
1191
+
Start and Stop parameters can be specified to limit the total search space. These are in terms of the total number of rows in a table.
1192
+
1193
+
.. ipython:: python
1194
+
1195
+
# this is effectively what the storage of a Panel looks like
You can create an index for a table with ``create_table_index`` after data is already in the table (after and ``append/put`` operation). Creating a table index is **highly** encouraged. This will speed your queries a great deal when you use a ``select`` with the indexed dimension as the ``where``. It is not automagically done now because you may want to index different axes than the default (except in the case of a DataFrame, where it almost always makes sense to index the ``index``.
1204
+
You can create/modify an index for a table with ``create_table_index`` after data is already in the table (after and ``append/put`` operation). Creating a table index is **highly** encouraged. This will speed your queries a great deal when you use a ``select`` with the indexed dimension as the ``where``. **Indexes are automagically created (starting 0.10.1)** on the indexables and any data columns you specify. This behavior can be turned off by passing ``index=False`` to ``append``.
1149
1205
1150
1206
.. ipython:: python
1151
1207
1152
-
# create an index
1153
-
store.create_table_index('df')
1208
+
# we have automagically already created an index (in the first section)
1154
1209
i = store.root.df.table.cols.index.index
1155
1210
i.optlevel, i.kind
1156
1211
@@ -1160,6 +1215,90 @@ You can create an index for a table with ``create_table_index`` after data is al
1160
1215
i.optlevel, i.kind
1161
1216
1162
1217
1218
+
Query via Data Columns
1219
+
~~~~~~~~~~~~~~~~~~~~~~
1220
+
You can designate (and index) certain columns that you want to be able to perform queries (other than the `indexable` columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query.
# we have automagically created this index and that the B/C/string/string2 columns are stored separately as ``PyTables`` columns
1242
+
store.root.df_dc.table
1243
+
1244
+
There is some performance degredation by making lots of columns into `data columns`, so it is up to the user to designate these. In addition, you cannot change data columns (nor indexables) after the first append/put operation (Of course you can simply read in the data and create a new table!)
1245
+
1246
+
Advanced Queries
1247
+
~~~~~~~~~~~~~~~~
1248
+
1249
+
**Unique**
1250
+
1251
+
To retrieve the *unique* values of an indexable or data column, use the method ``unique``. This will, for example, enable you to get the index very quickly. Note ``nan`` are excluded from the result set.
1252
+
1253
+
.. ipython:: python
1254
+
1255
+
store.unique('df_dc','index')
1256
+
store.unique('df_dc','string')
1257
+
1258
+
**Replicating or**
1259
+
1260
+
``not`` and ``or`` conditions are unsupported at this time; however, ``or`` operations are easy to replicate, by repeately applying the criteria to the table, and then ``concat`` the results.
concat([ store.select('df_dc',c) for c in [ crit1, crit2 ] ])
1268
+
1269
+
**Table Object**
1270
+
1271
+
If you want to inspect the table object, retrieve via ``get_table``. You could use this progamatically to say get the number of rows in the table.
1272
+
1273
+
.. ipython:: python
1274
+
1275
+
store.get_table('df_dc').nrows
1276
+
1277
+
Multiple Table Queries
1278
+
~~~~~~~~~~~~~~~~~~~~~~
1279
+
1280
+
New in 0.10.1 are the methods ``append_to_multple`` and ``select_as_multiple``, that can perform appending/selecting from multiple tables at once. The idea is to have one table (call it the selector table) that you index most/all of the columns, and perform your queries. The other table(s) are data tables that are indexed the same the selector table. You can then perform a very fast query on the selector table, yet get lots of data back. This method works similar to having a very wide-table, but is more efficient in terms of queries.
1281
+
1282
+
Note, **THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES**. This means, append to the tables in the same order; ``append_to_multiple`` splits a single object to multiple tables, given a specification (as a dictionary). This dictionary is a mapping of the table names to the 'columns' you want included in that table. Pass a `None` for a single table (optional) to let it have the remaining columns. The argument ``selector`` defines which table is the selector table.
You can delete from a table selectively by specifying a ``where``. In deleting rows, it is important to understand the ``PyTables`` deletes rows by erasing the rows, then **moving** the following data. Thus deleting can potentially be a very expensive operation depending on the orientation of your data. This is especially true in higher dimensional objects (``Panel`` and ``Panel4D``). To get optimal deletion speed, it pays to have the dimension you are deleting be the first of the ``indexables``.
@@ -1184,6 +1323,33 @@ It should be clear that a delete operation on the ``major_axis`` will be fairly
1184
1323
store.remove('wp', 'major_axis>20000102' )
1185
1324
store.select('wp')
1186
1325
1326
+
Please note that HDF5 **DOES NOT RECLAIM SPACE** in the h5 files automatically. Thus, repeatedly deleting (or removing nodes) and adding again **WILL TEND TO INCREASE THE FILE SIZE**. To *clean* the file, use ``ptrepack`` (see below).
1327
+
1328
+
Compression
1329
+
~~~~~~~~~~~
1330
+
``PyTables`` allows the stored data to be compressed. Tthis applies to all kinds of stores, not just tables.
1331
+
1332
+
- Pass ``complevel=int`` for a compression level (1-9, with 0 being no compression, and the default)
1333
+
- Pass ``complib=lib`` where lib is any of ``zlib, bzip2, lzo, blosc`` for whichever compression library you prefer.
1334
+
1335
+
``HDFStore`` will use the file based compression scheme if no overriding ``complib`` or ``complevel`` options are provided. ``blosc`` offers very fast compression, and is my most used. Note that ``lzo`` and ``bzip2`` may not be installed (by Python) by default.
``PyTables`` offer better write performance when compressed after writing them, as opposed to turning on compression at the very beginning. You can use the supplied ``PyTables`` utility ``ptrepack``. In addition, ``ptrepack`` can change compression levels after the fact.
Furthermore ``ptrepack in.h5 out.h5`` will *repack* the file to allow you to reuse previously deleted space (alternatively, one can simply remove the file and write again).
1352
+
1187
1353
Notes & Caveats
1188
1354
~~~~~~~~~~~~~~~
1189
1355
@@ -1216,14 +1382,9 @@ Performance
1216
1382
1217
1383
- ``Tables`` come with a writing performance penalty as compared to regular stores. The benefit is the ability to append/delete and query (potentially very large amounts of data).
1218
1384
Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.
1219
-
- ``Tables`` can (as of 0.10.0) be expressed as different types.
1220
-
1221
-
- ``AppendableTable`` which is a similiar table to past versions (this is the default).
1222
-
- ``WORMTable`` (pending implementation) - is available to faciliate very fast writing of tables that are also queryable (but CANNOT support appends)
1223
-
1224
-
- ``Tables`` offer better performance when compressed after writing them (as opposed to turning on compression at the very beginning)
1225
-
use the pytables utilities ``ptrepack`` to rewrite the file (and also can change compression methods)
1226
-
- Duplicate rows can be written, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)
1385
+
- You can pass ``chunksize=an integer`` to ``append``, to change the writing chunksize (default is 50000). This will signficantly lower your memory usage on writing.
1386
+
- You can pass ``expectedrows=an integer`` to the first ``append``, to set the TOTAL number of expectedrows that ``PyTables`` will expected. This will optimize read/write performance.
1387
+
- Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)
0 commit comments