-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_hdf is not read-threadsafe #12236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@toobaz your soln 'works', but that is not a recommended way to do it at all. HDF5 by definition should only every have a single writer. This helps with the SWMR case: https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html all that said Here's an example using a multi-proc threadpool; needs testing with a regular threadpool as well; this currently segfaults
|
Yes this is a duplicate of #14692. I have a small contribution to this discussion:
Works without any problems in the code example above. It seems that as long as multiple connections are made to the same store, and only one query is placed through those connections, there is no error. Looking at the example I provided in #14692 again, and also the original version of The fix may not involve placing locks on the store. Perhaps allowing |
I found another example of the issue. See #15274. Note the segfault failure rate depends on the presence or absence of compression on the hdf file. |
I am running into this issue on both OSX and Linux (centos), trying to parallelize using joblib. even when I am not trying any write process part of SWMR. Is there a different way for a readonly solution? Is this a pandas issue or a h5.c issue? Thanks. In my application I am trying to read different groups on different processors, and running into this error.
The code which works when
The error message I get is :
|
@jreback Interesting. I thought I had trouble with that and had hence moved to Also thank you for the pointer to the dask issues which discuss parallel write/save mechanisms. |
So I think I have just run into this issue but in a slightly different use case, I think. |
For the record, this is what I do. It is probably obvious, but just in case it could help anyone, here it is. HDF_LOCK = threading.Lock()
HDF_FILEPATH = '/my_file_path'
@contextmanager
def locked_store():
with HDF_LOCK:
with pd.HDFStore(HDF_FILEPATH) as store:
yield store Then def get(self, ts_id, t_start, t_end):
with locked_store() as store:
where = 'index>=t_start and index<t_end'
return store.select(ts_id, where=where) If several files are used, a dict of locks can be used to avoid locking the whole application. It can be a |
@lafrech you could even make HDF_FILEPATH just an argument to the context manager and make it more generic: @contextmanager
def locked_store(*args, lock=threading.Lock, **kwargs):
with lock():
with pd.HDFStore(*args, **kwargs) as store:
yield store |
@makmanalp, won't this create a new lock for each call (thus defeating the purpose of the lock)? We need to be sure that calls to the same file share the same lock and calls to different files don't. I don't see how this is achieved in your example. I assume I have to map the name, path or whatever identifier of the file to a lock instance. Or maybe I am missing something. |
Er, my mistake entirely - I was just trying to point out that one can make it more reusable. Should be something like: def make_locked_store(lock, filepath):
@contextmanager
def locked_store(*args, **kwargs):
with lock:
with pd.HDFStore(filepath, *args, **kwargs) as store:
yield store
return locked_store |
I've been trying this and had issues. It looks like the whole hdf5 lib is not thread-safe and it fails even when accessing two different files concurrently. I didn't take the time to check that, so I may be totally wrong, but beware if you try that. I reverted my change, and I keep a single lock in the application rather than a lock per hdf5 file. Since hdf5 is hierarchical, I guess lot of users put everything in a single file anyway. We ended up using lots of files mainly because hdf5 files tend to get corrupted so when restoring daily backups, we only loose data from a single file rather than from the whole base. |
I just got the very same problem with PyTables 3.4.4 and Pandas 0.24.1. |
Hi. In my project we have hundreds of HDF files, each files has many stores, each store 300000x400 cells. These are loaded selectively by our application both developing locally and by our dev/test/prod instances. These are all stored in a windows shared folder. Reading from a network share is a slow process, therefore it was a natural fix to read different stores in parallel. My code
Someone suggested protecting the calls to
Without the lock, the code crashes deep into the C code. It is quite obvious that Suggestions how to fix:
|
Is anyone working on this issue at the moment? |
At PyTables we're now working on a simple fix that might be released soon, see issue #776. |
* Resovles modin-project#940 * Refactor code to be modular so that Dask can use the implementation originally intended for Ray. * Builds on modin-project#754 * Remove a large amount of duplicate logic in the Column Store family of readers * Removes unnecessary classes and instead creates anonymous classes that mix-in the necessary components of the readers and call `.read` on the anonymous class. * An interesting performance issue came up with `HDFStore` and the `read_hdf` * Related to pandas-dev/pandas#12236 * With Ray, multiple workers can read hdf5 files, but it is about 4x slower than defaulting to pandas * Dask cannot read the hdf5 files in-parallel without seg-faults * Dask and Ray now support the same I/O * Performance was tested and it was discovered that Dask can be improved by setting `n_workers` to the number of CPUs on the machine. A new issue was created to track the performance tuning: modin-project#954
* Bring I/O support to Dask for everything supported * Resovles #940 * Refactor code to be modular so that Dask can use the implementation originally intended for Ray. * Builds on #754 * Remove a large amount of duplicate logic in the Column Store family of readers * Removes unnecessary classes and instead creates anonymous classes that mix-in the necessary components of the readers and call `.read` on the anonymous class. * An interesting performance issue came up with `HDFStore` and the `read_hdf` * Related to pandas-dev/pandas#12236 * With Ray, multiple workers can read hdf5 files, but it is about 4x slower than defaulting to pandas * Dask cannot read the hdf5 files in-parallel without seg-faults * Dask and Ray now support the same I/O * Performance was tested and it was discovered that Dask can be improved by setting `n_workers` to the number of CPUs on the machine. A new issue was created to track the performance tuning: #954
I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble. |
I agree, see https://stackoverflow.com/a/29014295/2858145 and #9641 (comment) |
xref: #2397
xref #14263 (example)
I think that we should protect the file open/closes with a lock to avoid this problem.
The text was updated successfully, but these errors were encountered: