Skip to content

Commit 48434d2

Browse files
committed
Merge PR #2561
2 parents 2b216af + 098cbfb commit 48434d2

File tree

11 files changed

+1689
-367
lines changed

11 files changed

+1689
-367
lines changed

RELEASE.rst

+37
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,43 @@ Where to get it
2222
* Binary installers on PyPI: http://pypi.python.org/pypi/pandas
2323
* Documentation: http://pandas.pydata.org
2424

25+
pandas 0.10.1
26+
=============
27+
28+
**Release date:** 2013-??-??
29+
30+
**New features**
31+
32+
**Improvements to existing features**
33+
34+
- ``HDFStore``
35+
- enables storing of multi-index dataframes (closes GH1277_)
36+
- support data column indexing and selection, via ``data_columns`` keyword in append
37+
- support write chunking to reduce memory footprint, via ``chunksize`` keyword to append
38+
- support automagic indexing via ``index`` keywork to append
39+
- support ``expectedrows`` keyword in append to inform ``PyTables`` about the expected tablesize
40+
- support ``start`` and ``stop`` keywords in select to limit the row selection space
41+
- added ``get_store`` context manager to automatically import with pandas
42+
- added column filtering via ``columns`` keyword in select
43+
- added methods append_to_multiple/select_as_multiple/select_as_coordinates to do multiple-table append/selection
44+
- added support for datetime64 in columns
45+
- added method ``unique`` to select the unique values in an indexable or data column
46+
47+
**Bug fixes**
48+
49+
- ``HDFStore``
50+
- correctly handle ``nan`` elements in string columns; serialize via the ``nan_rep`` keyword to append
51+
- raise correctly on non-implemented column types (unicode/date)
52+
- handle correctly ``Term`` passed types (e.g. ``index<1000``, when index is ``Int64``), (closes GH512_)
53+
54+
**API Changes**
55+
56+
- ``HDFStore``
57+
- removed keyword ``compression`` from ``put`` (replaced by keyword ``complib`` to be consistent across library)
58+
59+
.. _GH512: https://github.com/pydata/pandas/issues/512
60+
.. _GH1277: https://github.com/pydata/pandas/issues/1277
61+
2562
pandas 0.10.0
2663
=============
2764

doc/source/io.rst

+175-14
Original file line numberDiff line numberDiff line change
@@ -1030,6 +1030,17 @@ Deletion of the object specified by the key
10301030
del store['wp']
10311031
10321032
store
1033+
Closing a Store
1034+
1035+
.. ipython:: python
1036+
1037+
1038+
# closing a store
1039+
store.close()
1040+
1041+
# Working with, and automatically closing the store with the context manager.
1042+
with get_store('store.h5') as store:
1043+
store.keys()
10331044
10341045
.. ipython:: python
10351046
:suppress:
@@ -1095,14 +1106,19 @@ Storing Mixed Types in a Table
10951106
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10961107

10971108
Storing mixed-dtype data is supported. Strings are store as a fixed-width using the maximum size of the appended column. Subsequent appends will truncate strings at this length.
1098-
Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set a larger minimum for the string columns. Storing ``floats, strings, ints, bools`` are currently supported.
1109+
Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set a larger minimum for the string columns. Storing ``floats, strings, ints, bools, datetime64`` are currently supported. For string columns, passing ``nan_rep = 'my_nan_rep'`` to append will change the default nan representation on disk (which converts to/from `np.nan`), this defaults to `nan`.
10991110

11001111
.. ipython:: python
11011112
11021113
df_mixed = df.copy()
11031114
df_mixed['string'] = 'string'
11041115
df_mixed['int'] = 1
11051116
df_mixed['bool'] = True
1117+
df_mixed['datetime64'] = Timestamp('20010102')
1118+
1119+
# make sure that we have datetime64[ns] types
1120+
df_mixed = df_mixed.convert_objects()
1121+
df_mixed.ix[3:5,['A','B','string','datetime64']] = np.nan
11061122
11071123
store.append('df_mixed', df_mixed, min_itemsize = { 'values' : 50 })
11081124
df_mixed1 = store.select('df_mixed')
@@ -1112,10 +1128,33 @@ Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set
11121128
# we have provided a minimum string column size
11131129
store.root.df_mixed.table
11141130
1131+
It is ok to store ``np.nan`` in a ``float or string``. Make sure to do a ``convert_objects()`` on the frame before storing a ``np.nan`` in a datetime64 column. Storing a column with a ``np.nan`` in a ``int, bool`` will currently throw an ``Exception`` as these columns will have converted to ``object`` type.
1132+
1133+
Storing Multi-Index DataFrames
1134+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1135+
1136+
Storing multi-index dataframes as tables is very similar to storing/selecting from homogenous index DataFrames.
1137+
1138+
.. ipython:: python
1139+
1140+
index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
1141+
['one', 'two', 'three']],
1142+
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
1143+
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
1144+
names=['foo', 'bar'])
1145+
df_mi = DataFrame(np.random.randn(10, 3), index=index,
1146+
columns=['A', 'B', 'C'])
1147+
df_mi
1148+
1149+
store.append('df_mi',df_mi)
1150+
store.select('df_mi')
1151+
1152+
# the levels are automatically included as data columns
1153+
store.select('df_mi', Term('foo=bar'))
1154+
11151155
11161156
Querying a Table
11171157
~~~~~~~~~~~~~~~~
1118-
11191158
``select`` and ``delete`` operations have an optional criteria that can be specified to select/delete only
11201159
a subset of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data.
11211160

@@ -1128,7 +1167,7 @@ Valid terms can be created from ``dict, list, tuple, or string``. Objects can be
11281167

11291168
- ``dict(field = 'index', op = '>', value = '20121114')``
11301169
- ``('index', '>', '20121114')``
1131-
- ``'index>20121114'``
1170+
- ``'index > 20121114'``
11321171
- ``('index', '>', datetime(2012,11,14))``
11331172
- ``('index', ['20121114','20121115'])``
11341173
- ``('major_axis', '=', Timestamp('2012/11/14'))``
@@ -1143,14 +1182,30 @@ Queries are built up using a list of ``Terms`` (currently only **anding** of ter
11431182
store
11441183
store.select('wp',[ Term('major_axis>20000102'), Term('minor_axis', '=', ['A','B']) ])
11451184
1185+
The ``columns`` keyword can be supplied to select to filter a list of the return columns, this is equivalent to passing a ``Term('columns',list_of_columns_to_filter)``
1186+
1187+
.. ipython:: python
1188+
1189+
store.select('df', columns = ['A','B'])
1190+
1191+
Start and Stop parameters can be specified to limit the total search space. These are in terms of the total number of rows in a table.
1192+
1193+
.. ipython:: python
1194+
1195+
# this is effectively what the storage of a Panel looks like
1196+
wp.to_frame()
1197+
1198+
# limiting the search
1199+
store.select('wp',[ Term('major_axis>20000102'), Term('minor_axis', '=', ['A','B']) ], start=0, stop=10)
1200+
1201+
11461202
Indexing
11471203
~~~~~~~~
1148-
You can create an index for a table with ``create_table_index`` after data is already in the table (after and ``append/put`` operation). Creating a table index is **highly** encouraged. This will speed your queries a great deal when you use a ``select`` with the indexed dimension as the ``where``. It is not automagically done now because you may want to index different axes than the default (except in the case of a DataFrame, where it almost always makes sense to index the ``index``.
1204+
You can create/modify an index for a table with ``create_table_index`` after data is already in the table (after and ``append/put`` operation). Creating a table index is **highly** encouraged. This will speed your queries a great deal when you use a ``select`` with the indexed dimension as the ``where``. **Indexes are automagically created (starting 0.10.1)** on the indexables and any data columns you specify. This behavior can be turned off by passing ``index=False`` to ``append``.
11491205

11501206
.. ipython:: python
11511207
1152-
# create an index
1153-
store.create_table_index('df')
1208+
# we have automagically already created an index (in the first section)
11541209
i = store.root.df.table.cols.index.index
11551210
i.optlevel, i.kind
11561211
@@ -1160,6 +1215,90 @@ You can create an index for a table with ``create_table_index`` after data is al
11601215
i.optlevel, i.kind
11611216
11621217
1218+
Query via Data Columns
1219+
~~~~~~~~~~~~~~~~~~~~~~
1220+
You can designate (and index) certain columns that you want to be able to perform queries (other than the `indexable` columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query.
1221+
1222+
.. ipython:: python
1223+
1224+
df_dc = df.copy()
1225+
df_dc['string'] = 'foo'
1226+
df_dc.ix[4:6,'string'] = np.nan
1227+
df_dc.ix[7:9,'string'] = 'bar'
1228+
df_dc['string2'] = 'cool'
1229+
df_dc
1230+
1231+
# on-disk operations
1232+
store.append('df_dc', df_dc, data_columns = ['B','C','string','string2'])
1233+
store.select('df_dc',[ Term('B>0') ])
1234+
1235+
# getting creative
1236+
store.select('df_dc',[ 'B > 0', 'C > 0', 'string == foo' ])
1237+
1238+
# this is in-memory version of this type of selection
1239+
df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == 'foo')]
1240+
1241+
# we have automagically created this index and that the B/C/string/string2 columns are stored separately as ``PyTables`` columns
1242+
store.root.df_dc.table
1243+
1244+
There is some performance degredation by making lots of columns into `data columns`, so it is up to the user to designate these. In addition, you cannot change data columns (nor indexables) after the first append/put operation (Of course you can simply read in the data and create a new table!)
1245+
1246+
Advanced Queries
1247+
~~~~~~~~~~~~~~~~
1248+
1249+
**Unique**
1250+
1251+
To retrieve the *unique* values of an indexable or data column, use the method ``unique``. This will, for example, enable you to get the index very quickly. Note ``nan`` are excluded from the result set.
1252+
1253+
.. ipython:: python
1254+
1255+
store.unique('df_dc','index')
1256+
store.unique('df_dc','string')
1257+
1258+
**Replicating or**
1259+
1260+
``not`` and ``or`` conditions are unsupported at this time; however, ``or`` operations are easy to replicate, by repeately applying the criteria to the table, and then ``concat`` the results.
1261+
1262+
.. ipython:: python
1263+
1264+
crit1 = [ Term('B>0'), Term('C>0'), Term('string=foo') ]
1265+
crit2 = [ Term('B<0'), Term('C>0'), Term('string=foo') ]
1266+
1267+
concat([ store.select('df_dc',c) for c in [ crit1, crit2 ] ])
1268+
1269+
**Table Object**
1270+
1271+
If you want to inspect the table object, retrieve via ``get_table``. You could use this progamatically to say get the number of rows in the table.
1272+
1273+
.. ipython:: python
1274+
1275+
store.get_table('df_dc').nrows
1276+
1277+
Multiple Table Queries
1278+
~~~~~~~~~~~~~~~~~~~~~~
1279+
1280+
New in 0.10.1 are the methods ``append_to_multple`` and ``select_as_multiple``, that can perform appending/selecting from multiple tables at once. The idea is to have one table (call it the selector table) that you index most/all of the columns, and perform your queries. The other table(s) are data tables that are indexed the same the selector table. You can then perform a very fast query on the selector table, yet get lots of data back. This method works similar to having a very wide-table, but is more efficient in terms of queries.
1281+
1282+
Note, **THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES**. This means, append to the tables in the same order; ``append_to_multiple`` splits a single object to multiple tables, given a specification (as a dictionary). This dictionary is a mapping of the table names to the 'columns' you want included in that table. Pass a `None` for a single table (optional) to let it have the remaining columns. The argument ``selector`` defines which table is the selector table.
1283+
1284+
.. ipython:: python
1285+
1286+
df_mt = DataFrame(randn(8, 6), index=date_range('1/1/2000', periods=8),
1287+
columns=['A', 'B', 'C', 'D', 'E', 'F'])
1288+
df_mt['foo'] = 'bar'
1289+
1290+
# you can also create the tables individually
1291+
store.append_to_multiple({ 'df1_mt' : ['A','B'], 'df2_mt' : None }, df_mt, selector = 'df1_mt')
1292+
store
1293+
1294+
# indiviual tables were created
1295+
store.select('df1_mt')
1296+
store.select('df2_mt')
1297+
1298+
# as a multiple
1299+
store.select_as_multiple(['df1_mt','df2_mt'], where = [ 'A>0','B>0' ], selector = 'df1_mt')
1300+
1301+
11631302
Delete from a Table
11641303
~~~~~~~~~~~~~~~~~~~
11651304
You can delete from a table selectively by specifying a ``where``. In deleting rows, it is important to understand the ``PyTables`` deletes rows by erasing the rows, then **moving** the following data. Thus deleting can potentially be a very expensive operation depending on the orientation of your data. This is especially true in higher dimensional objects (``Panel`` and ``Panel4D``). To get optimal deletion speed, it pays to have the dimension you are deleting be the first of the ``indexables``.
@@ -1184,6 +1323,33 @@ It should be clear that a delete operation on the ``major_axis`` will be fairly
11841323
store.remove('wp', 'major_axis>20000102' )
11851324
store.select('wp')
11861325
1326+
Please note that HDF5 **DOES NOT RECLAIM SPACE** in the h5 files automatically. Thus, repeatedly deleting (or removing nodes) and adding again **WILL TEND TO INCREASE THE FILE SIZE**. To *clean* the file, use ``ptrepack`` (see below).
1327+
1328+
Compression
1329+
~~~~~~~~~~~
1330+
``PyTables`` allows the stored data to be compressed. Tthis applies to all kinds of stores, not just tables.
1331+
1332+
- Pass ``complevel=int`` for a compression level (1-9, with 0 being no compression, and the default)
1333+
- Pass ``complib=lib`` where lib is any of ``zlib, bzip2, lzo, blosc`` for whichever compression library you prefer.
1334+
1335+
``HDFStore`` will use the file based compression scheme if no overriding ``complib`` or ``complevel`` options are provided. ``blosc`` offers very fast compression, and is my most used. Note that ``lzo`` and ``bzip2`` may not be installed (by Python) by default.
1336+
1337+
Compression for all objects within the file
1338+
1339+
- ``store_compressed = HDFStore('store_compressed.h5', complevel=9, complib='blosc')``
1340+
1341+
Or on-the-fly compression (this only applies to tables). You can turn off file compression for a specific table by passing ``complevel=0``
1342+
1343+
- ``store.append('df', df, complib='zlib', complevel=5)``
1344+
1345+
**ptrepack**
1346+
1347+
``PyTables`` offer better write performance when compressed after writing them, as opposed to turning on compression at the very beginning. You can use the supplied ``PyTables`` utility ``ptrepack``. In addition, ``ptrepack`` can change compression levels after the fact.
1348+
1349+
- ``ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5``
1350+
1351+
Furthermore ``ptrepack in.h5 out.h5`` will *repack* the file to allow you to reuse previously deleted space (alternatively, one can simply remove the file and write again).
1352+
11871353
Notes & Caveats
11881354
~~~~~~~~~~~~~~~
11891355

@@ -1216,14 +1382,9 @@ Performance
12161382

12171383
- ``Tables`` come with a writing performance penalty as compared to regular stores. The benefit is the ability to append/delete and query (potentially very large amounts of data).
12181384
Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.
1219-
- ``Tables`` can (as of 0.10.0) be expressed as different types.
1220-
1221-
- ``AppendableTable`` which is a similiar table to past versions (this is the default).
1222-
- ``WORMTable`` (pending implementation) - is available to faciliate very fast writing of tables that are also queryable (but CANNOT support appends)
1223-
1224-
- ``Tables`` offer better performance when compressed after writing them (as opposed to turning on compression at the very beginning)
1225-
use the pytables utilities ``ptrepack`` to rewrite the file (and also can change compression methods)
1226-
- Duplicate rows can be written, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)
1385+
- You can pass ``chunksize=an integer`` to ``append``, to change the writing chunksize (default is 50000). This will signficantly lower your memory usage on writing.
1386+
- You can pass ``expectedrows=an integer`` to the first ``append``, to set the TOTAL number of expectedrows that ``PyTables`` will expected. This will optimize read/write performance.
1387+
- Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)
12271388

12281389
Experimental
12291390
~~~~~~~~~~~~

0 commit comments

Comments
 (0)