BUG: read_hdf is not read-threadsafe #12236

jreback · 2016-02-05T16:45:56Z

xref: #2397
xref #14263 (example)

I think that we should protect the file open/closes with a lock to avoid this problem.

toobaz · 2016-02-14T02:25:11Z

I agree!: although my code assumes only write access, and so it would require some fixing and integration, do you think the approach makes sense?

jreback · 2016-02-15T15:38:11Z

@toobaz your soln 'works', but that is not a recommended way to do it at all. HDF5 by definition should only every have a single writer.

This helps with the SWMR case: https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

all that said read_hdf is not thread-safe for reading I think because the file handle needs to be opened in the master thread.

Here's an example using a multi-proc threadpool; needs testing with a regular threadpool as well; this currently segfaults

import numpy as np
import pandas as pd
from pandas.util import testing as tm
from multiprocessing.pool import ThreadPool

path = 'test.hdf'
num_rows = 100000
num_tasks = 4

def make_df(num_rows=10000):

    df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
    df['foo'] = 'foo'
    df['bar'] = 'bar'
    df['baz'] = 'baz'
    df['date'] = pd.date_range('20000101 09:00:00',
                               periods=num_rows,
                               freq='s')
    df['int'] = np.arange(num_rows, dtype='int64')
    return df

print("writing df")
df = make_df(num_rows=num_rows)
df.to_hdf(path, 'df', format='table')

# single threaded
print("reading df - single threaded")
result = pd.read_hdf(path, 'df')
tm.assert_frame_equal(result, df)

# multip
def reader(arg):
    start, nrows = arg

    return pd.read_hdf(path,
                       'df',
                       mode='r',
                       start=start,
                       stop=start+nrows)

tasks = [
    (num_rows * i / num_tasks,
     num_rows / num_tasks) for i in range(num_tasks)
    ]

pool = ThreadPool(processes=num_tasks)

print("reading df - multi threaded")
results = pool.map(reader, tasks)
result = pd.concat(results)

tm.assert_frame_equal(result, df)

dragonator4 · 2016-11-18T23:06:49Z

Yes this is a duplicate of #14692. I have a small contribution to this discussion:

def reader(arg):
    start, nrows = arg
    with pd.HDFStore(path, 'r') as store:
        df = pd.read_hdf(store,
                         'df',
                         mode='r',
                         start=int(start),
                         stop=int(start+nrows))
    return df

Works without any problems in the code example above. It seems that as long as multiple connections are made to the same store, and only one query is placed through those connections, there is no error. Looking at the example I provided in #14692 again, and also the original version of reader given by @jreback above, multiple queries from the same connection cause all problems.

The fix may not involve placing locks on the store. Perhaps allowing df.read_hdf or other variants of accessing data from a store in read mode to make as many connection as required is the fix...

jtf621 · 2017-01-31T16:28:20Z

I found another example of the issue. See #15274.

Note the segfault failure rate depends on the presence or absence of compression on the hdf file.

rbiswas4 · 2017-04-23T05:37:10Z

I am running into this issue on both OSX and Linux (centos), trying to parallelize using joblib. even when I am not trying any write process part of SWMR. Is there a different way for a readonly solution? Is this a pandas issue or a h5.c issue? Thanks.

In my application I am trying to read different groups on different processors, and running into this error.
I am using

pandas                    0.19.1
pytables                  3.4.2
joblib                    0.11

The code which works when n_jobs=1 and fails when n_jobs=2 is 2 or more.

store = pd.HDFStore('ddf_lcs.hdf', mode='r')
params = pd.read_hdf('ddf_params.hdf')

tileIDs = params.tileID.dropna().unique().astype(np.int)

def runGroup(i, store=store, minion_params=params):
      print('group ', i)
      if i is np.nan:
          return
      low = i - 0.5
      high = i + 0.5
      print('i, low and high are ', i, low, high)
      key = str(int(i))
      lcdf = store.select(key)
      print('this group has {0} rows and {1} unique values of var'.format(len(lcdf), lcdf.snid.unique().size))
Parallel(n_jobs=1)(delayed(runGroup)(t) for t in tileIDs[:10])
store.close()

The error message I get is :

/usr/local/miniconda/lib/python2.7/site-packages/tables/group.py:1213: UserWarning: problems loading leaf ``/282748/table``::

  HDF5 error back trace

  File "H5Dio.c", line 173, in H5Dread
    can't read data
  File "H5Dio.c", line 554, in H5D__read
    can't read data
  File "H5Dchunk.c", line 1856, in H5D__chunk_read
    error looking up chunk address
  File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
    can't query chunk address
  File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
    can't get chunk info
  File "H5B.c", line 340, in H5B_find
    unable to load B-tree node
  File "H5AC.c", line 1262, in H5AC_protect
    H5C_protect() failed.
  File "H5C.c", line 3574, in H5C_protect
    can't load entry
  File "H5C.c", line 7954, in H5C_load_entry
    unable to load entry
 File "H5Bcache.c", line 143, in H5B__load
    wrong B-tree signature

End of HDF5 error back trace
...

jreback · 2017-04-23T13:38:49Z

@rbiswas4
you can't pass the handles in like this. you can try to open inside the passed function (as read-only) might work. this is a non-trivial problem (reading against HDF5 in a threadsafe manner via multiple processes). You can have a look at how dask solved (and still dealing with this).

rbiswas4 · 2017-04-23T17:21:10Z

you can try to open inside the passed function (as read-only)

@jreback Interesting. I thought I had trouble with that and had hence moved to hdfstore but it looks like I should be able to pass the filename to the function and read it with pd.read_hdf(filename, key=key) inside the function. Some initial tests seem to suggest that this is true indeed. That would solve the problem of reading. So if I am guessing correctly, the threadsafe mechanism is essentially in a lock on the open file process, and not on retrieval processes like get or select, which is what I thought HDFStore was for.

Also thank you for the pointer to the dask issues which discuss parallel write/save mechanisms.

grantstephens · 2017-09-09T07:51:08Z

So I think I have just run into this issue but in a slightly different use case, I think.
I have multiple threads that each do things with different hdf files using with pd.HDFStore. It seems to read the files fine, but when it comes to writing to them it fails with what looks like an error about the compression from pytables. Error:
if complib is not None and complib not in tables.filters.all_complibs: AttributeError: module 'tables' has no attribute 'filters'
I must just stress this, they are completely different files but I am trying to write to them using compression from different threads. I think it may be that I am trying to use the compression library at the same time, but any hints or advice would be appreciated.

lafrech · 2017-12-19T08:26:22Z

For the record, this is what I do. It is probably obvious, but just in case it could help anyone, here it is.

HDF_LOCK = threading.Lock()
HDF_FILEPATH = '/my_file_path'

@contextmanager
def locked_store():
    with HDF_LOCK:
        with pd.HDFStore(HDF_FILEPATH) as store:
            yield store

Then

    def get(self, ts_id, t_start, t_end):
        with locked_store() as store:
            where = 'index>=t_start and index<t_end'
            return store.select(ts_id, where=where)

If several files are used, a dict of locks can be used to avoid locking the whole application. It can be a defaultdict with a new Lock as default value. The key should be the absolute filepath with no symlink, to be sure a file can't be accessed with two different locks.

makmanalp · 2018-03-21T21:23:14Z

@lafrech you could even make HDF_FILEPATH just an argument to the context manager and make it more generic:

@contextmanager
def locked_store(*args, lock=threading.Lock, **kwargs):
    with lock():
        with pd.HDFStore(*args, **kwargs) as store:
            yield store

lafrech · 2018-03-21T21:33:20Z

@makmanalp, won't this create a new lock for each call (thus defeating the purpose of the lock)?

We need to be sure that calls to the same file share the same lock and calls to different files don't.

I don't see how this is achieved in your example.

I assume I have to map the name, path or whatever identifier of the file to a lock instance.

Or maybe I am missing something.

makmanalp · 2018-03-23T19:05:03Z

Er, my mistake entirely - I was just trying to point out that one can make it more reusable. Should be something like:

def make_locked_store(lock, filepath):
    @contextmanager
    def locked_store(*args, **kwargs):
        with lock:
            with pd.HDFStore(filepath, *args, **kwargs) as store:
                yield store
    return locked_store

lafrech · 2018-10-03T07:52:09Z

If several files are used, a dict of locks can be used to avoid locking the whole application.

⚠️ Warning ⚠️

I've been trying this and had issues. It looks like the whole hdf5 lib is not thread-safe and it fails even when accessing two different files concurrently. I didn't take the time to check that, so I may be totally wrong, but beware if you try that.

I reverted my change, and I keep a single lock in the application rather than a lock per hdf5 file.

Since hdf5 is hierarchical, I guess lot of users put everything in a single file anyway. We ended up using lots of files mainly because hdf5 files tend to get corrupted so when restoring daily backups, we only loose data from a single file rather than from the whole base.

schneiderfelipe · 2019-02-13T16:38:25Z

I just got the very same problem with PyTables 3.4.4 and Pandas 0.24.1.
Downgrading to the ones provided by Ubuntu (3.4.2-4 and 0.22.0-4, respectively) solved the issue.

joaoe · 2019-05-13T08:55:32Z

Hi.

In my project we have hundreds of HDF files, each files has many stores, each store 300000x400 cells. These are loaded selectively by our application both developing locally and by our dev/test/prod instances. These are all stored in a windows shared folder. Reading from a network share is a slow process, therefore it was a natural fix to read different stores in parallel.

My code

store = pd.HDFStore(fname, mode="r")
data = store[key]
store.close()

Someone suggested protecting the calls to HDFStore() and close() with a lock but the problem persists.

EOFError: Traceback (most recent call last):
  File "C:\...\test.py", line 75, in main
    data = store[key]
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 483, in __getitem__
    return self.get(key)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 671, in get
    return self._read_group(group)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 1349, in _read_group
    return s.read(**kwargs)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2895, in read
    ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2493, in read_index
    _, index = self.read_index_node(getattr(self.group, key), **kwargs)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2591, in read_index_node
    data = node[start:stop]
  File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 675, in __getitem__
    return self.read(start, stop, step)
  File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 815, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "c:\python27_64bit\lib\site-packages\tables\atom.py", line 1228, in fromarray
    return six.moves.cPickle.loads(array.tostring())

Without the lock, the code crashes deep into the C code.

It is quite obvious that tables.file.open() is the problem here, as that function uses a global FileRegistry.

Suggestions how to fix:

flock the file. Use a shared lock for read-only access, exclusive lock for write access. This will protect access across processes. And this can well fail (feature not available in OS or file in a remote share)
Do not share file handlers. Each open call should produce a new independent file handles. That how every other file IO API works. FileRegistry can still keep track of file handlers, but should not in any way share them or close previous file handlers and opening new ones.
Protect the file handlers internally with a ReadWriteLock (which works the same as flock but in memory) https://www.oreilly.com/library/view/python-cookbook/0596001673/ch06s04.html so write operations don't clobber each other nor disrupt read operations.

ZanSara · 2019-10-25T12:16:03Z

Is anyone working on this issue at the moment?
I've also hit this issue and I am willing to do some work on it if none is in the process already.

ZanSara · 2019-10-29T12:18:22Z

At PyTables we're now working on a simple fix that might be released soon, see issue #776.

* Resovles modin-project#940 * Refactor code to be modular so that Dask can use the implementation originally intended for Ray. * Builds on modin-project#754 * Remove a large amount of duplicate logic in the Column Store family of readers * Removes unnecessary classes and instead creates anonymous classes that mix-in the necessary components of the readers and call `.read` on the anonymous class. * An interesting performance issue came up with `HDFStore` and the `read_hdf` * Related to pandas-dev/pandas#12236 * With Ray, multiple workers can read hdf5 files, but it is about 4x slower than defaulting to pandas * Dask cannot read the hdf5 files in-parallel without seg-faults * Dask and Ray now support the same I/O * Performance was tested and it was discovered that Dask can be improved by setting `n_workers` to the number of CPUs on the machine. A new issue was created to track the performance tuning: modin-project#954

* Bring I/O support to Dask for everything supported * Resovles #940 * Refactor code to be modular so that Dask can use the implementation originally intended for Ray. * Builds on #754 * Remove a large amount of duplicate logic in the Column Store family of readers * Removes unnecessary classes and instead creates anonymous classes that mix-in the necessary components of the readers and call `.read` on the anonymous class. * An interesting performance issue came up with `HDFStore` and the `read_hdf` * Related to pandas-dev/pandas#12236 * With Ray, multiple workers can read hdf5 files, but it is about 4x slower than defaulting to pandas * Dask cannot read the hdf5 files in-parallel without seg-faults * Dask and Ray now support the same I/O * Performance was tested and it was discovered that Dask can be improved by setting `n_workers` to the number of CPUs on the machine. A new issue was created to track the performance tuning: #954

Han-aorweb · 2023-03-20T17:04:33Z

I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.

toobaz · 2023-03-20T22:58:22Z

I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.

I agree, see https://stackoverflow.com/a/29014295/2858145 and #9641 (comment)

jreback added Bug IO HDF5 read_hdf, HDFStore Difficulty Intermediate labels Feb 5, 2016

jreback added this to the Next Major Release milestone Feb 5, 2016

jreback mentioned this issue Mar 16, 2016

allow user-provided locks to da.store dask/dask#1053

Merged

thequackdaddy mentioned this issue Mar 17, 2016

Performance on laptop vs. virtual machine dask/dask#1052

Closed

jreback mentioned this issue Sep 20, 2016

read_hdf crash python process when use it in multithread code #14263

Closed

jreback mentioned this issue Nov 18, 2016

BUG: HDF5 Files cannot be read concurrently #14692

Closed

jtf621 mentioned this issue Jan 31, 2017

read_hdf is not read-threadsafe on MAC and Windows #15274

Closed

soumith mentioned this issue Nov 1, 2017

torch.utils.data.DataLoader并行处理h5文件时错误,单线程正常,并行报错. pytorch/pytorch#3415

Closed

jreback mentioned this issue May 17, 2018

Wrong documentation about PyTables threading support #21100

Closed

mdgoldberg mentioned this issue May 29, 2018

fit_generator Segmentation fault keras-team/keras#8225

Closed

jbrockmendel added the Multithreading Parallelism in pandas label Jul 25, 2018

rgreenblatt mentioned this issue Nov 13, 2018

Keras freezing on last batch of first epoch (can't move to second epoch) keras-team/keras#8595

Closed

DaanVanVugt mentioned this issue Jun 3, 2019

Can't open more than one file DaanVanVugt/h5pickle#7

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

ZanSara mentioned this issue Oct 28, 2019

Build libhdf5 with the --enable-threadsafe flag PyTables/PyTables#776

Closed

devin-petersohn mentioned this issue Jan 3, 2020

Bring I/O support to Dask for everything supported modin-project/modin#955

Merged

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Coolicedtea mentioned this issue Nov 8, 2023

Build wheel with Threadsafe hdf5 lib PyTables/PyTables#1075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_hdf is not read-threadsafe #12236

BUG: read_hdf is not read-threadsafe #12236

jreback commented Feb 5, 2016 •

edited

Loading

toobaz commented Feb 14, 2016

jreback commented Feb 15, 2016

dragonator4 commented Nov 18, 2016

jtf621 commented Jan 31, 2017

rbiswas4 commented Apr 23, 2017

jreback commented Apr 23, 2017

rbiswas4 commented Apr 23, 2017 •

edited

Loading

grantstephens commented Sep 9, 2017

lafrech commented Dec 19, 2017

makmanalp commented Mar 21, 2018

lafrech commented Mar 21, 2018

makmanalp commented Mar 23, 2018

lafrech commented Oct 3, 2018

schneiderfelipe commented Feb 13, 2019

joaoe commented May 13, 2019 •

edited

Loading

ZanSara commented Oct 25, 2019

ZanSara commented Oct 29, 2019

Han-aorweb commented Mar 20, 2023

toobaz commented Mar 20, 2023

BUG: read_hdf is not read-threadsafe #12236

BUG: read_hdf is not read-threadsafe #12236

Comments

jreback commented Feb 5, 2016 • edited Loading

toobaz commented Feb 14, 2016

jreback commented Feb 15, 2016

dragonator4 commented Nov 18, 2016

jtf621 commented Jan 31, 2017

rbiswas4 commented Apr 23, 2017

jreback commented Apr 23, 2017

rbiswas4 commented Apr 23, 2017 • edited Loading

grantstephens commented Sep 9, 2017

lafrech commented Dec 19, 2017

makmanalp commented Mar 21, 2018

lafrech commented Mar 21, 2018

makmanalp commented Mar 23, 2018

lafrech commented Oct 3, 2018

schneiderfelipe commented Feb 13, 2019

joaoe commented May 13, 2019 • edited Loading

ZanSara commented Oct 25, 2019

ZanSara commented Oct 29, 2019

Han-aorweb commented Mar 20, 2023

toobaz commented Mar 20, 2023

jreback commented Feb 5, 2016 •

edited

Loading

rbiswas4 commented Apr 23, 2017 •

edited

Loading

joaoe commented May 13, 2019 •

edited

Loading