Skip to content

ENH: PyTables Enhancements for future #2391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Nov 29, 2012 · 12 comments
Closed

ENH: PyTables Enhancements for future #2391

jreback opened this issue Nov 29, 2012 · 12 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore

Comments

@jreback
Copy link
Contributor

jreback commented Nov 29, 2012

open (not in any particular order)

  1. add support for other dtypes in table columns (datetime,date,unicode)
  2. Implement variable length strings in a parallel VLArray (and synchronize): Support a VLStringCol PyTables/PyTables#198
  3. revisit Term syntax - can we do better / more readability?
    3a. implement or in Terms (maybe use pyparsing like syntax)
  4. implement WORMTable
  5. one big area is to test whether data columns really are slower; it thus may make sense to make data columns = True the default (but not necessarily index them). see https://groups.google.com/forum/m/?fromgroups#!topic/pydata/cmw1F3OFJSc - see the end of this post for some perf tests, so this is prob not a good idea after all
  6. add export function, to export to different PyTables formats(an easy to read table for R (partially done), and output a GenericTable)
  7. provide better access to columns that are data_columns (as we can directly select them) - see read_column, expand this to the entire table (if possible), allows one to avoid selecting all columns in a table (and then reindexing), this works if columns argument is provided to select or inferred from the where.
  8. add out-of-core computation support (see my comment about 1/2 down in pandas converts int32 to int64 #622), this is partially supported now that we have an iterator (ENH: support iteration on returned results in select and select_as_multiple in HDFStore #3078)
  9. add a method to create a table structure (create_table)?, w/o actually appending, so don't have to add parms in each call to append.
  10. Support a better mechanism for table splitting Splitter? that a user can specify how to split (rather than a dict); then store this object, so can automatically recreate the resulting table (enable for both Storer and Table objects)
  11. Optimize table appending, I think we can do better! (GH PERF: HDFStore table writing performance improvements #3537) makes some improvements
  12. allow itemsize='truncate' to allow subsquent appends to proceed with string truncation (on specific columns)
  13. allow where in select_column, return a properly indexed Series, add option to include the index (use_index=True?)
  14. Better deal with a very long list as input to a Term, but running multiple or sub-queries
  15. Add support for coulumn oriented tables, dep is carray, http://carray.pytables.org/docs/manual/

done

  1. DONE (GH Pytables support for hierarchical keys #2401): access store paths via path notation / dot notation (GH BUG: issue in HDFStore with too many selectors in a where #2755)
  2. DONE (GH ENH: ndim tables in HDFStore (allow indexables to be passed in) #2497): add to docs (GH Different HDFStores in multiple threads crashes Python #2397) - issues about reading/writing concurrently in threads/processes
    http://sourceforge.net/mailarchive/message.php?msg_id=30190886
  3. DONE (GH ENH: ndim tables in HDFStore (allow indexables to be passed in) #2497): support panelnd (GH Panelnd #2242)
  4. DONE (GH ENH: added support for data column queries #2561): Should DataFrames be automagically indexed on 'index' (prob yes), but then should have a flag in append/put, and enable passing of the indexing options
  5. DONE (GH ENH: ndim tables in HDFStore (allow indexables to be passed in) #2497): Check if create_table_index changes the current index if different options are passed
  6. DONE (GH ENH: added support for data column queries #2561): for writing add chunk keyword to select to provide generator like behavior - each call to return the next chunk of data
  7. DONE (GH ENH: added support for data column queries #2561): support multi indexes on tables
    5a. DONE real dtype integration is coming on PR ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes #2708 (eg even though 0.10.1 will actually read/write float32 columns u can't really do much with them w/o having them upcasted) - in any event I think HDFStore will accommodate this already. but more testing needed
  8. DONE iterator support in select, http://stackoverflow.com/questions/14614512/merging-two-tables-with-millions-of-rows-in-python (GH ENH: support iteration on returned results in select and select_as_multiple in HDFStore #3078)
  9. DONE (GH ENH: HDFStore enhancements #3531) support timezones in datelike columns (index should be ok already) (scott?), (GH PyTables dates don't work when you switch to a different time zone #2852)
@gerigk
Copy link

gerigk commented Nov 29, 2012

what about allowing creation/access of groups by using "/" in the key.

i.e.,

store.put('some/path/to/df', df)

would create/access the groups some, path, to and finally df.

Right now I can only save the data on one level within an hdf5 file
although HDF5/PyTables supports access by file system like paths.
It would not break anything since the occurrence of a '/' raises an
exception right now.

On Thu, Nov 29, 2012 at 6:20 PM, jreback [email protected] wrote:

  1. add support for other dtypes in table columns
    (datetime64,datetime,date,unicode)

  2. support min_itemsize for table columns (currently supported only in
    indexers) also might be a better way of doing this (e.g. have the info
    attached to a dataframe, or support a global pandas option to provide a
    minimum)

  3. revisit Term syntax - can we do better / more readability?

  4. implement WORMTable


    Reply to this email directly or view it on GitHubhttps://github.com/ENH: PyTables Enhancements for future #2391.

@jreback
Copy link
Contributor Author

jreback commented Nov 29, 2012

good idea...shouldn't be too hard to implement

@scottkidder
Copy link

Here are things that are most interesting/beneficial to my current workload:

Full Float32 support & full pandas dtype support
WORMTable (unsure of implementation or performance gains)
data_columns is very useful and I can do more testing to determine how fast/slow they are.
**read_column would also be very useful in many instances.

I like the way Term's work. Is there support for ORing Terms or other logical operations in the Selection?

I can pick up work on any of these issues, but I would absolutely to like to discuss some of the details first.

@jreback
Copy link
Contributor Author

jreback commented Jan 30, 2013

Scott send me an email and I'll send u offline so we can correspond
[email protected]

@alvorithm
Copy link

Term language: perhaps it makes sense to piggyback on existing syntax. SQL comes to mind, but also XESAM (whole http://xesam.org is down at the time, but one can get the gist of it here: http://banshee.fm/support/guide/searching/.

@alvorithm
Copy link

It would be nice if attribute access (e.g. store.df) could be enabled for all the leaves that have suitable names. This might require a big API overhaul, though (store.df.append ...).

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2013

see #2485, this is actually somewhat easy in HDFStore, the problem is that pandas in general doesnt' propogate these attributes; you can easily store/retrieve attributes if you want on the nodes themselves

something like:

s = store.get_storer('df')
s.attrs['my_attribute'] = 1

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2013

sorry...misundestood your comment....(though you meant saving attributes)

attribute access on the store is not a big deal, will add to the list

@alvorithm
Copy link

Thank you for considering this, dotted access will save my pinky a lot of strain [''] (dead keys b/c need accents...).

Regarding attributes on DFs actually this would preempt a number of cases for specialization of DataFrame (see recent MetaDataFrame PR #2695) and in particular perhaps support the addition for metadata that would facilitate automated merges (foreign keys...).

EDIT: there was a discussion about this topic in the mailing list

@jreback
Copy link
Contributor Author

jreback commented Feb 7, 2013

see #2755 , was pretty easy to add dotted access, so i did!

@jreback
Copy link
Contributor Author

jreback commented Mar 12, 2013

@scottkidder did you get a chance to look at issue 13. #2852

@jreback
Copy link
Contributor Author

jreback commented Jul 25, 2016

dated

@jreback jreback closed this as completed Jul 25, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Someday Jul 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

5 participants