Skip to content

Updating HDFStore in place #6857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rockg opened this issue Apr 10, 2014 · 13 comments
Open

Updating HDFStore in place #6857

rockg opened this issue Apr 10, 2014 · 13 comments
Labels
Enhancement IO HDF5 read_hdf, HDFStore

Comments

@rockg
Copy link
Contributor

rockg commented Apr 10, 2014

Currently I use the HDF5 interface to store timeseries and it works great for selection. However, the only way I see to update them is to select the existing stored timeseries and then merge the update with the existing data. This can obviously be expensive if the existing series is large. Is it possible to update a store by just passing in the new data? If so, is there an example somewhere?

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

The modify methods are not implemented, see here: http://pytables.github.io/usersguide/libref/structured_storage.html#table-methods-writing

its not conceptually hard to do this, might be some details as the data needs to be coerced similar to when writing. Note that that the data has to be exactly the same dtype (otherwise PyTables will raise) as it simply an overwrite.

If you have small amount of data relative to a big set then it does make sense.

I don't do this myself as it makes 'cleaner' stores to simply write it again (from code perspective), as well as makes the store effectively read-only.

want to take a stab?

@jreback jreback added this to the 0.15.0 milestone Apr 10, 2014
@rockg
Copy link
Contributor Author

rockg commented Apr 10, 2014

Sure, I think this would be a great thing to have. Right now loading/rewriting 90k points to update 24 points doesn't make much sense.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

gr8!

I think maybe a signature something like:

store.modify(key,value,indexer) makes sense

value is a frame, same length as the indexer
indexer is an index (or convertible to the index) of the ROWS (e.g. Int64Index([10,30,31]))
and you can use modify_coordinates(indexer,rows) directly (after you convert the value to the rows).

You get the indexer by effecitively doing a select_coordinates(where.) so I presume that's what
you'd do, e.g. select something, modify it, then write it back.

I would start simple and make the user pass back the coordinates in.

@rockg
Copy link
Contributor Author

rockg commented Apr 10, 2014

And I think it would be nice to have some logic here if the data doesn't exist to append so only one update method needs to be called and it will append or modify appropriately.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

make a modify method, then can think about adding the append/modify logic

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@spitz-dan-l
Copy link

Hello, I'm wondering if there's been any progress on adding .modify() to HDFStore. It'd be such a boon for me. Thanks!

@rockg
Copy link
Contributor Author

rockg commented Jun 18, 2015

Unfortunately there hasn't. I agree that it would be great to add. I have something started, but not even worth pushing somewhere at this point.

@jmerkow
Copy link

jmerkow commented Apr 9, 2016

I would be interested in this too. But this seems dead.
Does anyone have a good work-around? Right now I believe you would need to delete the row with store.remove, the append it back with append. Or re-write the whole table (in which case you probably would just write a new file).
Is this feature available on any of the other IO options (possibly sql)?

@jreback
Copy link
Contributor

jreback commented Apr 9, 2016

see my comment above, you can implement the modify methods. Note this is not very efficient, but it will work. An HDF5 store works best when you append only, deleting/modify requires repacking the files (well it doesn't require it), but it is much more efficient that way.

@jmerkow
Copy link

jmerkow commented Apr 9, 2016

Yes I saw that. It seems that HDF5 would not work well for large data flows where the data is modified frequently even if someone were to write store.modify.
Do any of the other IO options already have methods to modify the table? Would it be more appropriate to add this feature to SQL IO? If you can give me some pointers, I can see how plausible it would be to make this contribution.

P.S. I can take this to the mailing list if you prefer.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2016

no, this is specifically for HDF5. You generally don't want to modify data, you simply append, that's how its designed and makes it performant. You can already do this with SQL if you really want.

@jmerkow
Copy link

jmerkow commented Apr 9, 2016

You can already do this with SQL if you really want.

Can you point me to this?

@jreback
Copy link
Contributor

jreback commented Apr 9, 2016

http://pandas.pydata.org/pandas-docs/stable/io.html#io-sql

you would simply do some kind of an update query.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

5 participants