Skip to content

BUG: read_hdf is not read-threadsafe #12236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jreback opened this issue Feb 5, 2016 · 19 comments
Open

BUG: read_hdf is not read-threadsafe #12236

jreback opened this issue Feb 5, 2016 · 19 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Multithreading Parallelism in pandas

Comments

@jreback
Copy link
Contributor

jreback commented Feb 5, 2016

xref: #2397
xref #14263 (example)

I think that we should protect the file open/closes with a lock to avoid this problem.

@jreback jreback added this to the Next Major Release milestone Feb 5, 2016
@toobaz
Copy link
Member

toobaz commented Feb 14, 2016

I agree!: although my code assumes only write access, and so it would require some fixing and integration, do you think the approach makes sense?

@jreback
Copy link
Contributor Author

jreback commented Feb 15, 2016

@toobaz your soln 'works', but that is not a recommended way to do it at all. HDF5 by definition should only every have a single writer.

This helps with the SWMR case: https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

all that said read_hdf is not thread-safe for reading I think because the file handle needs to be opened in the master thread.

Here's an example using a multi-proc threadpool; needs testing with a regular threadpool as well; this currently segfaults

import numpy as np
import pandas as pd
from pandas.util import testing as tm
from multiprocessing.pool import ThreadPool

path = 'test.hdf'
num_rows = 100000
num_tasks = 4

def make_df(num_rows=10000):

    df = pd.DataFrame(np.random.rand(num_rows, 5), columns=list('abcde'))
    df['foo'] = 'foo'
    df['bar'] = 'bar'
    df['baz'] = 'baz'
    df['date'] = pd.date_range('20000101 09:00:00',
                               periods=num_rows,
                               freq='s')
    df['int'] = np.arange(num_rows, dtype='int64')
    return df

print("writing df")
df = make_df(num_rows=num_rows)
df.to_hdf(path, 'df', format='table')

# single threaded
print("reading df - single threaded")
result = pd.read_hdf(path, 'df')
tm.assert_frame_equal(result, df)

# multip
def reader(arg):
    start, nrows = arg

    return pd.read_hdf(path,
                       'df',
                       mode='r',
                       start=start,
                       stop=start+nrows)

tasks = [
    (num_rows * i / num_tasks,
     num_rows / num_tasks) for i in range(num_tasks)
    ]

pool = ThreadPool(processes=num_tasks)

print("reading df - multi threaded")
results = pool.map(reader, tasks)
result = pd.concat(results)

tm.assert_frame_equal(result, df)

@dragonator4
Copy link

Yes this is a duplicate of #14692. I have a small contribution to this discussion:

def reader(arg):
    start, nrows = arg
    with pd.HDFStore(path, 'r') as store:
        df = pd.read_hdf(store,
                         'df',
                         mode='r',
                         start=int(start),
                         stop=int(start+nrows))
    return df

Works without any problems in the code example above. It seems that as long as multiple connections are made to the same store, and only one query is placed through those connections, there is no error. Looking at the example I provided in #14692 again, and also the original version of reader given by @jreback above, multiple queries from the same connection cause all problems.

The fix may not involve placing locks on the store. Perhaps allowing df.read_hdf or other variants of accessing data from a store in read mode to make as many connection as required is the fix...

@jtf621
Copy link

jtf621 commented Jan 31, 2017

I found another example of the issue. See #15274.

Note the segfault failure rate depends on the presence or absence of compression on the hdf file.

@rbiswas4
Copy link

I am running into this issue on both OSX and Linux (centos), trying to parallelize using joblib. even when I am not trying any write process part of SWMR. Is there a different way for a readonly solution? Is this a pandas issue or a h5.c issue? Thanks.

In my application I am trying to read different groups on different processors, and running into this error.
I am using

pandas                    0.19.1
pytables                  3.4.2
joblib                    0.11

The code which works when n_jobs=1 and fails when n_jobs=2 is 2 or more.

store = pd.HDFStore('ddf_lcs.hdf', mode='r')
params = pd.read_hdf('ddf_params.hdf')

tileIDs = params.tileID.dropna().unique().astype(np.int)

def runGroup(i, store=store, minion_params=params):
      print('group ', i)
      if i is np.nan:
          return
      low = i - 0.5
      high = i + 0.5
      print('i, low and high are ', i, low, high)
      key = str(int(i))
      lcdf = store.select(key)
      print('this group has {0} rows and {1} unique values of var'.format(len(lcdf), lcdf.snid.unique().size))
Parallel(n_jobs=1)(delayed(runGroup)(t) for t in tileIDs[:10])
store.close()

The error message I get is :

/usr/local/miniconda/lib/python2.7/site-packages/tables/group.py:1213: UserWarning: problems loading leaf ``/282748/table``::

  HDF5 error back trace

  File "H5Dio.c", line 173, in H5Dread
    can't read data
  File "H5Dio.c", line 554, in H5D__read
    can't read data
  File "H5Dchunk.c", line 1856, in H5D__chunk_read
    error looking up chunk address
  File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
    can't query chunk address
  File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
    can't get chunk info
  File "H5B.c", line 340, in H5B_find
    unable to load B-tree node
  File "H5AC.c", line 1262, in H5AC_protect
    H5C_protect() failed.
  File "H5C.c", line 3574, in H5C_protect
    can't load entry
  File "H5C.c", line 7954, in H5C_load_entry
    unable to load entry
 File "H5Bcache.c", line 143, in H5B__load
    wrong B-tree signature

End of HDF5 error back trace
...

@jreback
Copy link
Contributor Author

jreback commented Apr 23, 2017

@rbiswas4
you can't pass the handles in like this. you can try to open inside the passed function (as read-only) might work. this is a non-trivial problem (reading against HDF5 in a threadsafe manner via multiple processes). You can have a look at how dask solved (and still dealing with this).

@rbiswas4
Copy link

rbiswas4 commented Apr 23, 2017

you can try to open inside the passed function (as read-only)

@jreback Interesting. I thought I had trouble with that and had hence moved to hdfstore but it looks like I should be able to pass the filename to the function and read it with pd.read_hdf(filename, key=key) inside the function. Some initial tests seem to suggest that this is true indeed. That would solve the problem of reading. So if I am guessing correctly, the threadsafe mechanism is essentially in a lock on the open file process, and not on retrieval processes like get or select, which is what I thought HDFStore was for.

Also thank you for the pointer to the dask issues which discuss parallel write/save mechanisms.

@grantstephens
Copy link

So I think I have just run into this issue but in a slightly different use case, I think.
I have multiple threads that each do things with different hdf files using with pd.HDFStore. It seems to read the files fine, but when it comes to writing to them it fails with what looks like an error about the compression from pytables. Error:
if complib is not None and complib not in tables.filters.all_complibs: AttributeError: module 'tables' has no attribute 'filters'
I must just stress this, they are completely different files but I am trying to write to them using compression from different threads. I think it may be that I am trying to use the compression library at the same time, but any hints or advice would be appreciated.

@lafrech
Copy link

lafrech commented Dec 19, 2017

For the record, this is what I do. It is probably obvious, but just in case it could help anyone, here it is.

HDF_LOCK = threading.Lock()
HDF_FILEPATH = '/my_file_path'

@contextmanager
def locked_store():
    with HDF_LOCK:
        with pd.HDFStore(HDF_FILEPATH) as store:
            yield store

Then

    def get(self, ts_id, t_start, t_end):
        with locked_store() as store:
            where = 'index>=t_start and index<t_end'
            return store.select(ts_id, where=where)

If several files are used, a dict of locks can be used to avoid locking the whole application. It can be a defaultdict with a new Lock as default value. The key should be the absolute filepath with no symlink, to be sure a file can't be accessed with two different locks.

@makmanalp
Copy link
Contributor

@lafrech you could even make HDF_FILEPATH just an argument to the context manager and make it more generic:

@contextmanager
def locked_store(*args, lock=threading.Lock, **kwargs):
    with lock():
        with pd.HDFStore(*args, **kwargs) as store:
            yield store

@lafrech
Copy link

lafrech commented Mar 21, 2018

@makmanalp, won't this create a new lock for each call (thus defeating the purpose of the lock)?

We need to be sure that calls to the same file share the same lock and calls to different files don't.

I don't see how this is achieved in your example.

I assume I have to map the name, path or whatever identifier of the file to a lock instance.

Or maybe I am missing something.

@makmanalp
Copy link
Contributor

Er, my mistake entirely - I was just trying to point out that one can make it more reusable. Should be something like:

def make_locked_store(lock, filepath):
    @contextmanager
    def locked_store(*args, **kwargs):
        with lock:
            with pd.HDFStore(filepath, *args, **kwargs) as store:
                yield store
    return locked_store

@lafrech
Copy link

lafrech commented Oct 3, 2018

If several files are used, a dict of locks can be used to avoid locking the whole application.

⚠️ Warning ⚠️

I've been trying this and had issues. It looks like the whole hdf5 lib is not thread-safe and it fails even when accessing two different files concurrently. I didn't take the time to check that, so I may be totally wrong, but beware if you try that.

I reverted my change, and I keep a single lock in the application rather than a lock per hdf5 file.

Since hdf5 is hierarchical, I guess lot of users put everything in a single file anyway. We ended up using lots of files mainly because hdf5 files tend to get corrupted so when restoring daily backups, we only loose data from a single file rather than from the whole base.

@schneiderfelipe
Copy link

I just got the very same problem with PyTables 3.4.4 and Pandas 0.24.1.
Downgrading to the ones provided by Ubuntu (3.4.2-4 and 0.22.0-4, respectively) solved the issue.

@joaoe
Copy link

joaoe commented May 13, 2019

Hi.

In my project we have hundreds of HDF files, each files has many stores, each store 300000x400 cells. These are loaded selectively by our application both developing locally and by our dev/test/prod instances. These are all stored in a windows shared folder. Reading from a network share is a slow process, therefore it was a natural fix to read different stores in parallel.

My code

store = pd.HDFStore(fname, mode="r")
data = store[key]
store.close()

Someone suggested protecting the calls to HDFStore() and close() with a lock but the problem persists.

EOFError: Traceback (most recent call last):
  File "C:\...\test.py", line 75, in main
    data = store[key]
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 483, in __getitem__
    return self.get(key)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 671, in get
    return self._read_group(group)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 1349, in _read_group
    return s.read(**kwargs)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2895, in read
    ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2493, in read_index
    _, index = self.read_index_node(getattr(self.group, key), **kwargs)
  File "c:\python27_64bit\lib\site-packages\pandas\io\pytables.py", line 2591, in read_index_node
    data = node[start:stop]
  File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 675, in __getitem__
    return self.read(start, stop, step)
  File "c:\python27_64bit\lib\site-packages\tables\vlarray.py", line 815, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "c:\python27_64bit\lib\site-packages\tables\atom.py", line 1228, in fromarray
    return six.moves.cPickle.loads(array.tostring())

Without the lock, the code crashes deep into the C code.

It is quite obvious that tables.file.open() is the problem here, as that function uses a global FileRegistry.

Suggestions how to fix:

  1. flock the file. Use a shared lock for read-only access, exclusive lock for write access. This will protect access across processes. And this can well fail (feature not available in OS or file in a remote share)
  2. Do not share file handlers. Each open call should produce a new independent file handles. That how every other file IO API works. FileRegistry can still keep track of file handlers, but should not in any way share them or close previous file handlers and opening new ones.
  3. Protect the file handlers internally with a ReadWriteLock (which works the same as flock but in memory) https://www.oreilly.com/library/view/python-cookbook/0596001673/ch06s04.html so write operations don't clobber each other nor disrupt read operations.

@ZanSara
Copy link

ZanSara commented Oct 25, 2019

Is anyone working on this issue at the moment?
I've also hit this issue and I am willing to do some work on it if none is in the process already.

@ZanSara
Copy link

ZanSara commented Oct 29, 2019

At PyTables we're now working on a simple fix that might be released soon, see issue #776.

devin-petersohn added a commit to devin-petersohn/modin that referenced this issue Jan 3, 2020
* Resovles modin-project#940
* Refactor code to be modular so that Dask can use the implementation
  originally intended for Ray.
* Builds on modin-project#754
* Remove a large amount of duplicate logic in the Column Store family of
  readers
* Removes unnecessary classes and instead creates anonymous classes that
  mix-in the necessary components of the readers and call `.read` on the
  anonymous class.
* An interesting performance issue came up with `HDFStore` and the
  `read_hdf`
  * Related to pandas-dev/pandas#12236
  * With Ray, multiple workers can read hdf5 files, but it is about 4x
    slower than defaulting to pandas
  * Dask cannot read the hdf5 files in-parallel without seg-faults
* Dask and Ray now support the same I/O
* Performance was tested and it was discovered that Dask can be improved
  by setting `n_workers` to the number of CPUs on the machine. A new
  issue was created to track the performance tuning: modin-project#954
devin-petersohn added a commit to modin-project/modin that referenced this issue Jan 3, 2020
* Bring I/O support to Dask for everything supported

* Resovles #940
* Refactor code to be modular so that Dask can use the implementation
  originally intended for Ray.
* Builds on #754
* Remove a large amount of duplicate logic in the Column Store family of
  readers
* Removes unnecessary classes and instead creates anonymous classes that
  mix-in the necessary components of the readers and call `.read` on the
  anonymous class.
* An interesting performance issue came up with `HDFStore` and the
  `read_hdf`
  * Related to pandas-dev/pandas#12236
  * With Ray, multiple workers can read hdf5 files, but it is about 4x
    slower than defaulting to pandas
  * Dask cannot read the hdf5 files in-parallel without seg-faults
* Dask and Ray now support the same I/O
* Performance was tested and it was discovered that Dask can be improved
  by setting `n_workers` to the number of CPUs on the machine. A new
  issue was created to track the performance tuning: #954
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@Han-aorweb
Copy link

I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.

@toobaz
Copy link
Member

toobaz commented Mar 20, 2023

I had the same issue where either multithreading reading/writing in the same or different hdf5 file throw random crashes. I was able to avoid it by adding a global lock. Hope can help you if you have trouble.

I agree, see https://stackoverflow.com/a/29014295/2858145 and #9641 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Multithreading Parallelism in pandas
Projects
None yet
Development

No branches or pull requests