ENH: added support for data column queries #2561

jreback · 2012-12-18T23:06:59Z

can construct searches on the actual columns of the data, by passing keyword data_columns to append
e.g. store.select('df_dc',[ Term('B>0'), Term('string=foo') ])
(where B and string are columns in the frame)
added nan_rep for supporting string columns with nan's in them
corectly determine dtypes of data columns that we cannot deal with (date,unicode), raise UnImplementedError
added support for datetime64 in columns
added method unique to fast retrieval of indexables or data columns
added keyword index=True to automagically create indicies on all indexables and data columns!
performance enhancements on string columns
added chunksize parameter to append to allow write chunking, significantly lower memory usage on writes
added expectedrows parameter to append to allow specification of the TOTAL expected rows in a table (to optimize performance)
added start/stop parameters to select to allow limiting of the selection space
added multi-index dataframe support (closes GH Support for hierarchical indexes in HDFStore with table=True #1277 )
added append_to_multple, select_as_multiple and select_as_coordinates methods to support multiple-table creation & selection
removed compression kw from put, replaced with complib (to make more consistent across library)
Term parsing a little more robust
more tests & docs for data columns
added whatsnew 0.10.1 docs

jreback · 2012-12-20T15:15:02Z

@wesm i think this is ready to merge...also...if at all possible, once merged, can put up a dev build for 2.7 amd-64 on the site for testing....

jreback · 2012-12-20T15:38:48Z

@wesm I believe these changes will allow you to close: GH #512, GH #1277
GH #698 and GH #2397 can also be closed (as of 0.10.0)

jreback · 2012-12-23T16:58:41Z

@wesm done adding things - ready to merge when u r

wesm · 2012-12-28T13:59:01Z

Hey jeff, one problem here is that the legacy test file is about 10 megs-- I don't want to bloat the size of the git repo or source archive if at all possible. fixing it may not be so bad using the interactive rebase approach described here:

http://stackoverflow.com/questions/2100907/how-to-purge-a-huge-file-from-commits-history-in-git

if you don't have time i could take a crack at it. will need a smaller test h5 file, though, i guess

jreback · 2012-12-28T14:14:15Z

I can make smaller np

will repost in a few

On Dec 28, 2012, at 8:59 AM, Wes McKinney [email protected] wrote:

Hey jeff, one problem here is that the legacy test file is about 10 megs-- I don't want to bloat the size of the git repo or source archive if at all possible. fixing it may not be so bad using the interactive rebase approach described here:

http://stackoverflow.com/questions/2100907/how-to-purge-a-huge-file-from-commits-history-in-git

if you don't have time i could take a crack at it. will need a smaller test h5 file, though, i guess

—
Reply to this email directly or view it on GitHub.

jreback · 2012-12-28T14:32:08Z

posted a revised file much smaller now
thxs

…rches on the actual columns of the data) added nan_rep for supporting string columns with nan's in them performance enhancements on string columns more tests & docs for data columns

…he same type) e.g. self.store.select('df', [ Term('string', '=', 'foo'), Term('string2=foo'), Term('A>0'), Term('B<0') ])

…rror (in cases of unicode/datetime64/date)

…e passed to append/put)

added parameter chunksize to append, now writing occurs in chunks, significatnly reducing memory usage

add expectedrows keyword to append to give pytables an estimate of the total rows in a new table add start/stop keywords as selection criteria to limit searches to these rows added multi-index support for dataframes docs/tests for the above

… examples confusing

…n the results from a selector table. this allows one to potentially put the data you really want to index in a single table, and your actual (wide) data in another to speed queries

renamed keyword 'columns' to 'data_columns' when passed to 'append' (to avoid confusion with 'columns' keyword in select)

changed to use simpler cython routine to avoid copying

…eation at append time

…ndexable or data column w/o selecting the entire table

… make nomenclature consistent for compression. doc updates for compression

jreback · 2012-12-28T14:58:33Z

ok I did the interactive rebase
then added the new file
had to force update so hopefully merge works ok

wesm · 2012-12-28T15:07:48Z

Oops. I had already done the rebase. All is good and merged now, closing the PR

jreback mentioned this pull request Dec 20, 2012

ENH: PyTables Enhancements for future #2391

Closed

jreback added 23 commits December 28, 2012 09:43

ENH/BUG/DOC: added support for data column queries (can construct sea…

9b0aac0

…rches on the actual columns of the data) added nan_rep for supporting string columns with nan's in them performance enhancements on string columns more tests & docs for data columns

removed conf.py paths

9408d59

BUG: support multiple data columns that are in the same block (e.g. t…

5c7e849

…he same type) e.g. self.store.select('df', [ Term('string', '=', 'foo'), Term('string2=foo'), Term('A>0'), Term('B<0') ])

ENH: correctly interpret data column dtypes and raise NotImplementedE…

c749c18

…rror (in cases of unicode/datetime64/date)

ENH: automagically created indicies (controlled by kw index=True/Fals…

2927768

…e passed to append/put)

DOC: minor doc updates and use cases

97bdb5c

ENH/DOC: updated docs for compression

af43f71

added parameter chunksize to append, now writing occurs in chunks, significatnly reducing memory usage

DOC: doc updates for multi-index & start/stop

0180e79

DOC: added whatsnew 0.10.1

c3e580e

DOC: minor RELEAST.rst addition

3d75a3e

DOC: docstring updates

88a06e2

DOC: RELEASE notes updates

91526a3

DOC: io.rst example for multi-index frame was propgating, making next…

2570a3b

… examples confusing

BUG: reworked versioning to only act on specific version

a780c4c

BUG: more robust to whitespace in Terms

73d7554

BUG: make Term more robust to whitespace and syntax

dcbc020

BUG: versioning issue bug!

81aaa7c

ENH: added column filtering via keyword 'columns' passed to select

04a1aa9

ENH: allow multiple table selection. retrieve multiple tables based o…

1c32ebf

…n the results from a selector table. this allows one to potentially put the data you really want to index in a single table, and your actual (wide) data in another to speed queries

BUG: renamed method select_multiple -> select_as_multiple

c314534

renamed keyword 'columns' to 'data_columns' when passed to 'append' (to avoid confusion with 'columns' keyword in select)

ENH: added append_to_multiple, to support multiple table creation

cbbae3d

removed paths from conf.py

228df0b

jreback added 12 commits December 28, 2012 09:43

DOC: minor doc updates/typos

aafe311

DOC: minor doc updates 2

47b0ad4

BUG: added datetime64 support in columns

3cdc0cd

BUG: updated tests for datetim64 detection in columns

6c2dd27

removed paths from conf.py

a130c62

BUG/TST: min_itemsize not working on data_columns, added more tests

1a3301c

BUG: performance issue with reconsituting string arrays

2e3a3c6

changed to use simpler cython routine to avoid copying

ENH: allow index=list of columns or True/False/None to guide index cr…

a602839

…eation at append time

BUG: minor change in way expectedrows works (better defaults)

6bac894

ENH: added unique method to store, for selectin unique values in an i…

e078ead

…ndexable or data column w/o selecting the entire table

CLN: removed keywork 'compression' from put (replaced by complib), to…

6c58bf7

… make nomenclature consistent for compression. doc updates for compression

BUG: updated with smaller legacy_0.10.h5 file

17b6c0d

wesm added a commit that referenced this pull request Dec 28, 2012

Merge PR #2561

48434d2

wesm closed this Dec 28, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: added support for data column queries #2561

ENH: added support for data column queries #2561

jreback commented Dec 18, 2012

jreback commented Dec 20, 2012

jreback commented Dec 20, 2012

jreback commented Dec 23, 2012

wesm commented Dec 28, 2012

jreback commented Dec 28, 2012

jreback commented Dec 28, 2012

jreback commented Dec 28, 2012

wesm commented Dec 28, 2012

ENH: added support for data column queries #2561

ENH: added support for data column queries #2561

Conversation

jreback commented Dec 18, 2012

jreback commented Dec 20, 2012

jreback commented Dec 20, 2012

jreback commented Dec 23, 2012

wesm commented Dec 28, 2012

jreback commented Dec 28, 2012

jreback commented Dec 28, 2012

jreback commented Dec 28, 2012

wesm commented Dec 28, 2012