Skip to content

Pytables selection enhancement & docs update for HDF5 tables #2264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 15, 2012

  1. added str (to do repr)

  2. row removal in tables is much faster if rows are consecutive

  3. added Term class, refactored Selection (this is backwards compatible)
    Term is a concise way of specifying conditions for queries, e.g.

    Term(dict(field = 'index', op = '>', value = '20121114'))
    Term('index', '20121114')
    Term('index', '>', '20121114')
    Term('index', ['20121114','20121114'])
    Term('index', datetime(2012,11,14))
    Term('index>20121114')
    

    updated tests for same

    this should close GH PyTables enhancements for selection #1996

  4. added docs for HDF5 table in io.html

  5. append on a table that didn't exist was failing (because of testing of the index_kind attribute first - which may not exist)
    fixed & added test

  6. added create_table_index method to create indicies on tables (which, btw now works quite well as Int64 indicies are used as opposed to the Time64Col which has a bug); includes a check on the pytables version requirement

    this should close GH Add option to create indexes in HDFStore if user is using PyTables Pro / PyTables 2.3+ #698

  7. added min_itemsize as a paremeter to append; allows bigger default indexer columns upon table creation (even if you don't append something that big - but might later, avoid the truncation issue)

  8. incorporated 0.9.1 whatsnew docs for where & mask into Indexing Section of main docs

  1. added __str__ (to do __repr__)
  2. row removal in tables is much faster if rows are consecutive
  3. added Term class, refactored Selection (this is backdwards compatible)
     Term is a concise way of specifying conditions for queries, e.g.

        Term(dict(field = 'index', op = '>', value = '20121114'))
        Term('index', '20121114')
        Term('index', '>', '20121114')
        Term('index', ['20121114','20121114'])
        Term('index', datetime(2012,11,14))
        Term('index>20121114')

     updated tests for same

  this should close GH pandas-dev#1996
…e (see test_append)

this the result of incompatibility testing on the index_kind
@jreback
Copy link
Contributor Author

jreback commented Nov 15, 2012

@Thisch ..thanks...that is fixed.....let me know if anything else is unclear.....this functionality has mostly been in pandas for a while, but undoced....so you can try it out

…of index columns minimum size

changed pytables version test for indexing around a bit
added Col class to manage the column conversions
added alias to the Term class; you can specify the nomial indexers (e.g. index in DataFrame, major_axis/minor_axis or alias in Panel)
updated docs for pytables to reflect these changes
updated docs for indexing to incorporate whatsnew 0.9.1 for where and mask
@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2012

@wesm I have also some preliminary work on speeding up table writes (with panels), currently takes about 9.5s for 1M rows (e.g. a 6 x 1000 x 1000 panel); made code a lot simpler and using cython..down to about 6s; about 1/2 of overhead is from pytables actually writing it, other from creating a list of tuples (which is then turned into a recarray by pytables) - prob will be able to PR this next week

@wesm
Copy link
Member

wesm commented Nov 17, 2012

I think you can write whole blocks of data instead of going row by row and go much faster? @John-Colvin has worked on this I think

@jreback
Copy link
Contributor Author

jreback commented Nov 17, 2012

I had some discussions with John (waiting for some sample timings)

but I think there r really 2 cases here

  1. if u want a searchable table u could write it in one shot (say blocks of columns) - selecting is a bit tricky as u have to compute the values indices (eg select from major and minor then compute the offsets for the values) - could be done - but this becomes a fixed table - u cannot easily append (and have the ability to search) -

  2. the current approach, while slower in writing allows searching and appending

so I suppose some use cases might prefer having faster writing and still have a searching ability

I prefer to write my tables in small batches, but need to preserve searching (and reading is quite fast anyhow)

I suppose If there is enough interest could support both approaches

On Nov 16, 2012, at 10:16 PM, Wes McKinney [email protected] wrote:

I think you can write whole blocks of data instead of going row by row and go much faster? @John-Colvin has worked on this I think


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2012

closing this - going to put in a new PR soon that is a bit cleaner

@jreback jreback closed this Nov 24, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants