Pytables selection enhancement & docs update for HDF5 tables #2264

jreback · 2012-11-15T20:16:53Z

added str (to do repr)
row removal in tables is much faster if rows are consecutive
added Term class, refactored Selection (this is backwards compatible)
Term is a concise way of specifying conditions for queries, e.g.
```
Term(dict(field = 'index', op = '>', value = '20121114'))
Term('index', '20121114')
Term('index', '>', '20121114')
Term('index', ['20121114','20121114'])
Term('index', datetime(2012,11,14))
Term('index>20121114')
```
updated tests for same

this should close GH PyTables enhancements for selection #1996
added docs for HDF5 table in io.html
append on a table that didn't exist was failing (because of testing of the index_kind attribute first - which may not exist)
fixed & added test
added create_table_index method to create indicies on tables (which, btw now works quite well as Int64 indicies are used as opposed to the Time64Col which has a bug); includes a check on the pytables version requirement

this should close GH Add option to create indexes in HDFStore if user is using PyTables Pro / PyTables 2.3+ #698
added min_itemsize as a paremeter to append; allows bigger default indexer columns upon table creation (even if you don't append something that big - but might later, avoid the truncation issue)
incorporated 0.9.1 whatsnew docs for where & mask into Indexing Section of main docs

1. added __str__ (to do __repr__) 2. row removal in tables is much faster if rows are consecutive 3. added Term class, refactored Selection (this is backdwards compatible) Term is a concise way of specifying conditions for queries, e.g. Term(dict(field = 'index', op = '>', value = '20121114')) Term('index', '20121114') Term('index', '>', '20121114') Term('index', ['20121114','20121114']) Term('index', datetime(2012,11,14)) Term('index>20121114') updated tests for same this should close GH pandas-dev#1996

…e (see test_append) this the result of incompatibility testing on the index_kind

think about doing this automagically for tables

jreback · 2012-11-15T22:01:34Z

@Thisch ..thanks...that is fixed.....let me know if anything else is unclear.....this functionality has mostly been in pandas for a while, but undoced....so you can try it out

…of index columns minimum size changed pytables version test for indexing around a bit added Col class to manage the column conversions added alias to the Term class; you can specify the nomial indexers (e.g. index in DataFrame, major_axis/minor_axis or alias in Panel) updated docs for pytables to reflect these changes updated docs for indexing to incorporate whatsnew 0.9.1 for where and mask

jreback · 2012-11-17T03:11:11Z

@wesm I have also some preliminary work on speeding up table writes (with panels), currently takes about 9.5s for 1M rows (e.g. a 6 x 1000 x 1000 panel); made code a lot simpler and using cython..down to about 6s; about 1/2 of overhead is from pytables actually writing it, other from creating a list of tuples (which is then turned into a recarray by pytables) - prob will be able to PR this next week

wesm · 2012-11-17T03:16:15Z

I think you can write whole blocks of data instead of going row by row and go much faster? @John-Colvin has worked on this I think

jreback · 2012-11-17T03:28:54Z

I had some discussions with John (waiting for some sample timings)

but I think there r really 2 cases here

if u want a searchable table u could write it in one shot (say blocks of columns) - selecting is a bit tricky as u have to compute the values indices (eg select from major and minor then compute the offsets for the values) - could be done - but this becomes a fixed table - u cannot easily append (and have the ability to search) -
the current approach, while slower in writing allows searching and appending

so I suppose some use cases might prefer having faster writing and still have a searching ability

I prefer to write my tables in small batches, but need to preserve searching (and reading is quite fast anyhow)

I suppose If there is enough interest could support both approaches

On Nov 16, 2012, at 10:16 PM, Wes McKinney [email protected] wrote:

I think you can write whole blocks of data instead of going row by row and go much faster? @John-Colvin has worked on this I think

—
Reply to this email directly or view it on GitHub.

jreback · 2012-11-24T18:09:04Z

closing this - going to put in a new PR soon that is a bit cleaner

jreback added 3 commits November 15, 2012 13:57

update the HDF5 documentation to support table operations in io.html

0fcae82

a store would fail if appending but the a put had not been done befor…

72f557e

…e (see test_append) this the result of incompatibility testing on the index_kind

jreback mentioned this pull request Nov 15, 2012

pytables selection enhancements (to close GH #1966) #2261

Closed

jreback added 2 commits November 15, 2012 16:45

added create_table_index to index tables

a1956cb

think about doing this automagically for tables

updated io.rst with some typos and docs for indicies

f619462

jreback closed this Nov 24, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytables selection enhancement & docs update for HDF5 tables #2264

Pytables selection enhancement & docs update for HDF5 tables #2264

jreback commented Nov 15, 2012

jreback commented Nov 15, 2012

jreback commented Nov 17, 2012

wesm commented Nov 17, 2012

jreback commented Nov 17, 2012

jreback commented Nov 24, 2012

Pytables selection enhancement & docs update for HDF5 tables #2264

Pytables selection enhancement & docs update for HDF5 tables #2264

Conversation

jreback commented Nov 15, 2012

jreback commented Nov 15, 2012

jreback commented Nov 17, 2012

wesm commented Nov 17, 2012

jreback commented Nov 17, 2012

jreback commented Nov 24, 2012