Skip to content

ENH: added support for data column queries #2561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
9b0aac0
ENH/BUG/DOC: added support for data column queries (can construct sea…
jreback Dec 18, 2012
9408d59
removed conf.py paths
jreback Dec 19, 2012
5c7e849
BUG: support multiple data columns that are in the same block (e.g. t…
jreback Dec 19, 2012
c749c18
ENH: correctly interpret data column dtypes and raise NotImplementedE…
jreback Dec 19, 2012
2927768
ENH: automagically created indicies (controlled by kw index=True/Fals…
jreback Dec 19, 2012
97bdb5c
DOC: minor doc updates and use cases
jreback Dec 19, 2012
af43f71
ENH/DOC: updated docs for compression
jreback Dec 19, 2012
ce6a7a9
ENH: export of get_store context manager in __init__ for pandas
jreback Dec 20, 2012
0180e79
DOC: doc updates for multi-index & start/stop
jreback Dec 20, 2012
c3e580e
DOC: added whatsnew 0.10.1
jreback Dec 20, 2012
3d75a3e
DOC: minor RELEAST.rst addition
jreback Dec 20, 2012
88a06e2
DOC: docstring updates
jreback Dec 20, 2012
91526a3
DOC: RELEASE notes updates
jreback Dec 20, 2012
2570a3b
DOC: io.rst example for multi-index frame was propgating, making next…
jreback Dec 20, 2012
a780c4c
BUG: reworked versioning to only act on specific version
jreback Dec 21, 2012
73d7554
BUG: more robust to whitespace in Terms
jreback Dec 21, 2012
dcbc020
BUG: make Term more robust to whitespace and syntax
jreback Dec 21, 2012
81aaa7c
BUG: versioning issue bug!
jreback Dec 21, 2012
04a1aa9
ENH: added column filtering via keyword 'columns' passed to select
jreback Dec 22, 2012
1c32ebf
ENH: allow multiple table selection. retrieve multiple tables based o…
jreback Dec 22, 2012
c314534
BUG: renamed method select_multiple -> select_as_multiple
jreback Dec 23, 2012
cbbae3d
ENH: added append_to_multiple, to support multiple table creation
jreback Dec 23, 2012
228df0b
removed paths from conf.py
jreback Dec 23, 2012
aafe311
DOC: minor doc updates/typos
jreback Dec 23, 2012
47b0ad4
DOC: minor doc updates 2
jreback Dec 23, 2012
3cdc0cd
BUG: added datetime64 support in columns
jreback Dec 23, 2012
6c2dd27
BUG: updated tests for datetim64 detection in columns
jreback Dec 23, 2012
a130c62
removed paths from conf.py
jreback Dec 23, 2012
1a3301c
BUG/TST: min_itemsize not working on data_columns, added more tests
jreback Dec 26, 2012
2e3a3c6
BUG: performance issue with reconsituting string arrays
jreback Dec 26, 2012
a602839
ENH: allow index=list of columns or True/False/None to guide index cr…
jreback Dec 26, 2012
6bac894
BUG: minor change in way expectedrows works (better defaults)
jreback Dec 26, 2012
e078ead
ENH: added unique method to store, for selectin unique values in an i…
jreback Dec 27, 2012
6c58bf7
CLN: removed keywork 'compression' from put (replaced by complib), to…
jreback Dec 27, 2012
17b6c0d
BUG: updated with smaller legacy_0.10.h5 file
jreback Dec 28, 2012
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions RELEASE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,43 @@ Where to get it
* Binary installers on PyPI: http://pypi.python.org/pypi/pandas
* Documentation: http://pandas.pydata.org

pandas 0.10.1
=============

**Release date:** 2013-??-??

**New features**

**Improvements to existing features**

- ``HDFStore``
- enables storing of multi-index dataframes (closes GH1277_)
- support data column indexing and selection, via ``data_columns`` keyword in append
- support write chunking to reduce memory footprint, via ``chunksize`` keyword to append
- support automagic indexing via ``index`` keywork to append
- support ``expectedrows`` keyword in append to inform ``PyTables`` about the expected tablesize
- support ``start`` and ``stop`` keywords in select to limit the row selection space
- added ``get_store`` context manager to automatically import with pandas
- added column filtering via ``columns`` keyword in select
- added methods append_to_multiple/select_as_multiple/select_as_coordinates to do multiple-table append/selection
- added support for datetime64 in columns
- added method ``unique`` to select the unique values in an indexable or data column

**Bug fixes**

- ``HDFStore``
- correctly handle ``nan`` elements in string columns; serialize via the ``nan_rep`` keyword to append
- raise correctly on non-implemented column types (unicode/date)
- handle correctly ``Term`` passed types (e.g. ``index<1000``, when index is ``Int64``), (closes GH512_)

**API Changes**

- ``HDFStore``
- removed keyword ``compression`` from ``put`` (replaced by keyword ``complib`` to be consistent across library)

.. _GH512: https://github.com/pydata/pandas/issues/512
.. _GH1277: https://github.com/pydata/pandas/issues/1277

pandas 0.10.0
=============

Expand Down
189 changes: 175 additions & 14 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1030,6 +1030,17 @@ Deletion of the object specified by the key
del store['wp']

store
Closing a Store

.. ipython:: python


# closing a store
store.close()

# Working with, and automatically closing the store with the context manager.
with get_store('store.h5') as store:
store.keys()

.. ipython:: python
:suppress:
Expand Down Expand Up @@ -1095,14 +1106,19 @@ Storing Mixed Types in a Table
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Storing mixed-dtype data is supported. Strings are store as a fixed-width using the maximum size of the appended column. Subsequent appends will truncate strings at this length.
Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set a larger minimum for the string columns. Storing ``floats, strings, ints, bools`` are currently supported.
Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set a larger minimum for the string columns. Storing ``floats, strings, ints, bools, datetime64`` are currently supported. For string columns, passing ``nan_rep = 'my_nan_rep'`` to append will change the default nan representation on disk (which converts to/from `np.nan`), this defaults to `nan`.

.. ipython:: python

df_mixed = df.copy()
df_mixed['string'] = 'string'
df_mixed['int'] = 1
df_mixed['bool'] = True
df_mixed['datetime64'] = Timestamp('20010102')

# make sure that we have datetime64[ns] types
df_mixed = df_mixed.convert_objects()
df_mixed.ix[3:5,['A','B','string','datetime64']] = np.nan

store.append('df_mixed', df_mixed, min_itemsize = { 'values' : 50 })
df_mixed1 = store.select('df_mixed')
Expand All @@ -1112,10 +1128,33 @@ Passing ``min_itemsize = { `values` : size }`` as a parameter to append will set
# we have provided a minimum string column size
store.root.df_mixed.table

It is ok to store ``np.nan`` in a ``float or string``. Make sure to do a ``convert_objects()`` on the frame before storing a ``np.nan`` in a datetime64 column. Storing a column with a ``np.nan`` in a ``int, bool`` will currently throw an ``Exception`` as these columns will have converted to ``object`` type.

Storing Multi-Index DataFrames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Storing multi-index dataframes as tables is very similar to storing/selecting from homogenous index DataFrames.

.. ipython:: python

index = MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
names=['foo', 'bar'])
df_mi = DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
df_mi

store.append('df_mi',df_mi)
store.select('df_mi')

# the levels are automatically included as data columns
store.select('df_mi', Term('foo=bar'))


Querying a Table
~~~~~~~~~~~~~~~~

``select`` and ``delete`` operations have an optional criteria that can be specified to select/delete only
a subset of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data.

Expand All @@ -1128,7 +1167,7 @@ Valid terms can be created from ``dict, list, tuple, or string``. Objects can be

- ``dict(field = 'index', op = '>', value = '20121114')``
- ``('index', '>', '20121114')``
- ``'index>20121114'``
- ``'index > 20121114'``
- ``('index', '>', datetime(2012,11,14))``
- ``('index', ['20121114','20121115'])``
- ``('major_axis', '=', Timestamp('2012/11/14'))``
Expand All @@ -1143,14 +1182,30 @@ Queries are built up using a list of ``Terms`` (currently only **anding** of ter
store
store.select('wp',[ Term('major_axis>20000102'), Term('minor_axis', '=', ['A','B']) ])

The ``columns`` keyword can be supplied to select to filter a list of the return columns, this is equivalent to passing a ``Term('columns',list_of_columns_to_filter)``

.. ipython:: python

store.select('df', columns = ['A','B'])

Start and Stop parameters can be specified to limit the total search space. These are in terms of the total number of rows in a table.

.. ipython:: python

# this is effectively what the storage of a Panel looks like
wp.to_frame()

# limiting the search
store.select('wp',[ Term('major_axis>20000102'), Term('minor_axis', '=', ['A','B']) ], start=0, stop=10)


Indexing
~~~~~~~~
You can create an index for a table with ``create_table_index`` after data is already in the table (after and ``append/put`` operation). Creating a table index is **highly** encouraged. This will speed your queries a great deal when you use a ``select`` with the indexed dimension as the ``where``. It is not automagically done now because you may want to index different axes than the default (except in the case of a DataFrame, where it almost always makes sense to index the ``index``.
You can create/modify an index for a table with ``create_table_index`` after data is already in the table (after and ``append/put`` operation). Creating a table index is **highly** encouraged. This will speed your queries a great deal when you use a ``select`` with the indexed dimension as the ``where``. **Indexes are automagically created (starting 0.10.1)** on the indexables and any data columns you specify. This behavior can be turned off by passing ``index=False`` to ``append``.

.. ipython:: python

# create an index
store.create_table_index('df')
# we have automagically already created an index (in the first section)
i = store.root.df.table.cols.index.index
i.optlevel, i.kind

Expand All @@ -1160,6 +1215,90 @@ You can create an index for a table with ``create_table_index`` after data is al
i.optlevel, i.kind


Query via Data Columns
~~~~~~~~~~~~~~~~~~~~~~
You can designate (and index) certain columns that you want to be able to perform queries (other than the `indexable` columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query.

.. ipython:: python

df_dc = df.copy()
df_dc['string'] = 'foo'
df_dc.ix[4:6,'string'] = np.nan
df_dc.ix[7:9,'string'] = 'bar'
df_dc['string2'] = 'cool'
df_dc

# on-disk operations
store.append('df_dc', df_dc, data_columns = ['B','C','string','string2'])
store.select('df_dc',[ Term('B>0') ])

# getting creative
store.select('df_dc',[ 'B > 0', 'C > 0', 'string == foo' ])

# this is in-memory version of this type of selection
df_dc[(df_dc.B > 0) & (df_dc.C > 0) & (df_dc.string == 'foo')]

# we have automagically created this index and that the B/C/string/string2 columns are stored separately as ``PyTables`` columns
store.root.df_dc.table

There is some performance degredation by making lots of columns into `data columns`, so it is up to the user to designate these. In addition, you cannot change data columns (nor indexables) after the first append/put operation (Of course you can simply read in the data and create a new table!)

Advanced Queries
~~~~~~~~~~~~~~~~

**Unique**

To retrieve the *unique* values of an indexable or data column, use the method ``unique``. This will, for example, enable you to get the index very quickly. Note ``nan`` are excluded from the result set.

.. ipython:: python

store.unique('df_dc','index')
store.unique('df_dc','string')

**Replicating or**

``not`` and ``or`` conditions are unsupported at this time; however, ``or`` operations are easy to replicate, by repeately applying the criteria to the table, and then ``concat`` the results.

.. ipython:: python

crit1 = [ Term('B>0'), Term('C>0'), Term('string=foo') ]
crit2 = [ Term('B<0'), Term('C>0'), Term('string=foo') ]

concat([ store.select('df_dc',c) for c in [ crit1, crit2 ] ])

**Table Object**

If you want to inspect the table object, retrieve via ``get_table``. You could use this progamatically to say get the number of rows in the table.

.. ipython:: python

store.get_table('df_dc').nrows

Multiple Table Queries
~~~~~~~~~~~~~~~~~~~~~~

New in 0.10.1 are the methods ``append_to_multple`` and ``select_as_multiple``, that can perform appending/selecting from multiple tables at once. The idea is to have one table (call it the selector table) that you index most/all of the columns, and perform your queries. The other table(s) are data tables that are indexed the same the selector table. You can then perform a very fast query on the selector table, yet get lots of data back. This method works similar to having a very wide-table, but is more efficient in terms of queries.

Note, **THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES**. This means, append to the tables in the same order; ``append_to_multiple`` splits a single object to multiple tables, given a specification (as a dictionary). This dictionary is a mapping of the table names to the 'columns' you want included in that table. Pass a `None` for a single table (optional) to let it have the remaining columns. The argument ``selector`` defines which table is the selector table.

.. ipython:: python

df_mt = DataFrame(randn(8, 6), index=date_range('1/1/2000', periods=8),
columns=['A', 'B', 'C', 'D', 'E', 'F'])
df_mt['foo'] = 'bar'

# you can also create the tables individually
store.append_to_multiple({ 'df1_mt' : ['A','B'], 'df2_mt' : None }, df_mt, selector = 'df1_mt')
store

# indiviual tables were created
store.select('df1_mt')
store.select('df2_mt')

# as a multiple
store.select_as_multiple(['df1_mt','df2_mt'], where = [ 'A>0','B>0' ], selector = 'df1_mt')


Delete from a Table
~~~~~~~~~~~~~~~~~~~
You can delete from a table selectively by specifying a ``where``. In deleting rows, it is important to understand the ``PyTables`` deletes rows by erasing the rows, then **moving** the following data. Thus deleting can potentially be a very expensive operation depending on the orientation of your data. This is especially true in higher dimensional objects (``Panel`` and ``Panel4D``). To get optimal deletion speed, it pays to have the dimension you are deleting be the first of the ``indexables``.
Expand All @@ -1184,6 +1323,33 @@ It should be clear that a delete operation on the ``major_axis`` will be fairly
store.remove('wp', 'major_axis>20000102' )
store.select('wp')

Please note that HDF5 **DOES NOT RECLAIM SPACE** in the h5 files automatically. Thus, repeatedly deleting (or removing nodes) and adding again **WILL TEND TO INCREASE THE FILE SIZE**. To *clean* the file, use ``ptrepack`` (see below).

Compression
~~~~~~~~~~~
``PyTables`` allows the stored data to be compressed. Tthis applies to all kinds of stores, not just tables.

- Pass ``complevel=int`` for a compression level (1-9, with 0 being no compression, and the default)
- Pass ``complib=lib`` where lib is any of ``zlib, bzip2, lzo, blosc`` for whichever compression library you prefer.

``HDFStore`` will use the file based compression scheme if no overriding ``complib`` or ``complevel`` options are provided. ``blosc`` offers very fast compression, and is my most used. Note that ``lzo`` and ``bzip2`` may not be installed (by Python) by default.

Compression for all objects within the file

- ``store_compressed = HDFStore('store_compressed.h5', complevel=9, complib='blosc')``

Or on-the-fly compression (this only applies to tables). You can turn off file compression for a specific table by passing ``complevel=0``

- ``store.append('df', df, complib='zlib', complevel=5)``

**ptrepack**

``PyTables`` offer better write performance when compressed after writing them, as opposed to turning on compression at the very beginning. You can use the supplied ``PyTables`` utility ``ptrepack``. In addition, ``ptrepack`` can change compression levels after the fact.

- ``ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc in.h5 out.h5``

Furthermore ``ptrepack in.h5 out.h5`` will *repack* the file to allow you to reuse previously deleted space (alternatively, one can simply remove the file and write again).

Notes & Caveats
~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -1216,14 +1382,9 @@ Performance

- ``Tables`` come with a writing performance penalty as compared to regular stores. The benefit is the ability to append/delete and query (potentially very large amounts of data).
Write times are generally longer as compared with regular stores. Query times can be quite fast, especially on an indexed axis.
- ``Tables`` can (as of 0.10.0) be expressed as different types.

- ``AppendableTable`` which is a similiar table to past versions (this is the default).
- ``WORMTable`` (pending implementation) - is available to faciliate very fast writing of tables that are also queryable (but CANNOT support appends)

- ``Tables`` offer better performance when compressed after writing them (as opposed to turning on compression at the very beginning)
use the pytables utilities ``ptrepack`` to rewrite the file (and also can change compression methods)
- Duplicate rows can be written, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)
- You can pass ``chunksize=an integer`` to ``append``, to change the writing chunksize (default is 50000). This will signficantly lower your memory usage on writing.
- You can pass ``expectedrows=an integer`` to the first ``append``, to set the TOTAL number of expectedrows that ``PyTables`` will expected. This will optimize read/write performance.
- Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus a table is unique on major, minor pairs)

Experimental
~~~~~~~~~~~~
Expand Down
Loading