`torch.utils.data.DataLoader`并行处理h5文件时错误,单线程正常,并行报错. #3415

flystarhe · 2017-11-01T03:42:24Z

torch.utils.data.DataLoader并行处理h5文件时错误,单线程正常,并行报错.

代码如下:

    print('==> Loading datasets')
    train_set = DatasetFromHdf5(opt.trainSet)
    training_data_loader = DataLoader(dataset=train_set, num_workers=opt.threads, batch_size=opt.batchSize, shuffle=True)

错误如下:

OSError: Traceback (most recent call last):
  File "/home/hejian/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 40, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/hejian/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 40, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/data2/slyx-sr/src/dataset.py", line 14, in __getitem__
    return torch.from_numpy(self.data[index,:,:,:]).float(), torch.from_numpy(self.target[index,:,:,:]).float()
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2846)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2804)
  File "/home/hejian/anaconda3/lib/python3.5/site-packages/h5py/_hl/dataset.py", line 494, in __getitem__
    self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2846)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2804)
  File "h5py/h5d.pyx", line 181, in h5py.h5d.DatasetID.read (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/h5d.c:3413)
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_proxy.c:2008)
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_proxy.c:1656)
OSError: Can't read data (Wrong b-tree signature)

The text was updated successfully, but these errors were encountered:

flystarhe · 2017-11-01T03:42:50Z

class DatasetFromHdf5(data.Dataset):
    def __init__(self, file_path):
        super(DatasetFromHdf5, self).__init__()
        hf = h5py.File(file_path)
        self.data = hf.get('data')
        self.target = hf.get('label')

    def __getitem__(self, index):
        return torch.from_numpy(self.data[index,:,:,:]).float(), torch.from_numpy(self.target[index,:,:,:]).float()

    def __len__(self):
        return self.data.shape[0]

fmassa · 2017-11-01T03:48:49Z

Maybe hdf5 is not thread safe? Does it work without threads?

soumith · 2017-11-01T11:18:26Z

this is a HDF5 issue. The problem is that HDF5 concurrent reads aren't safe:
pandas-dev/pandas#12236
pandas-dev/pandas#14692

To actually allow concurrent reads for a file you have to use SWMR feature of HDF5: https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

soumith · 2017-11-01T11:21:31Z

Actually this thread gives proper workarounds as well: https://stackoverflow.com/questions/34906652/does-hdf5-support-concurrent-reads-or-writes-to-different-files

I think if you use python 3 and at the top of your main script (not the dataset), before you import h5py, if you add the following line it will be fixed:

import torch.multiprocessing as mp
mp.set_start_method('spawn')

flystarhe · 2017-11-03T06:51:24Z

@soumith thanks

Vandermode · 2017-12-08T02:51:47Z

@soumith when I add

import torch.multiprocessing as mp
mp.set_start_method('spawn')

into the top of my main script, another error occurred

Traceback (most recent call last):
  File "utility/dataset.py", line 237, in <module>
    data_iter = iter(train_loader)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 310, in __iter__
    return DataLoaderIter(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 167, in __init__
    w.start()
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 274, in _Popen
    return Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 33, in __init__
    super().__init__(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 48, in _launch
    reduction.dump(process_obj, fp)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_train_valid_loader.<locals>.<lambda>'

any idea? thx

well, this error can be fixed by directly apply Dataloader on dataset rather than use an additional function get_train_valid_loader to wrap it. But still not solve this problem, in my situation, it returns another error:

RuntimeError: context has already been set
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
    exitcode = _main(fd)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 115, in _main
    prepare(preparation_data)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 226, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 278, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 254, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/kaixuan/DATA/Papers/Code/utility/dataset.py", line 7, in <module>
    mp.set_start_method('spawn')
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 231, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Another odd thing is the process not returns as if it stuck.
Also note when I delete the two line added code, and set the num_workers to 1, it can return the correct result.

zhbbupt · 2017-12-23T14:25:11Z

你解决这个问题了吗? @flystarhe

flystarhe · 2017-12-25T01:23:38Z

@zhbbupt 没有,我拖鞋了,单进程运行的

ooteki · 2018-05-31T09:34:57Z

同样的问题，单线程能正常读取，多线程就出错。

Jongchan · 2018-07-02T11:20:02Z

I think you can bypass the runtime error by exception handler.
try: set_start_method('spawn') except RuntimeError: pass
It's mentioned in #3492 (comment)

I am not 100% sure that it will work, but you can try. It's just 3 more lines of codes.

RizhaoCai · 2019-01-26T13:13:47Z

同样的问题，单线程能正常读取，多线程就出错。

请问你解决这个问题了吗？难道h5 dataset只能单线程来了？

lumaku · 2019-02-01T09:16:17Z

@RizhaoCai You can read a HDF5-file with multithreading using the SWMR feature in the newer h5py library version.
There is a way to compile the hdf5 library to be thread-safe, but the h5py-version I got from my distribution was not compiled this way. Instead of opening the file in the parent process, I opened the file in each worker for each read (which has a certain overhead, but worked for me), for example:

with h5py.File(file_name, 'r', libver='latest', swmr=True) as f:
    assert f.swmr_mode
    x = np.array( f[dset_name] )

RizhaoCai · 2019-02-20T07:22:43Z

@RizhaoCai You can read a HDF5-file with multithreading using the SWMR feature in the newer h5py library version.
There is a way to compile the hdf5 library to be thread-safe, but the h5py-version I got from my distribution was not compiled this way. Instead of opening the file in the parent process, I opened the file in each worker for each read (which has a certain overhead, but worked for me), for example:
with h5py.File(file_name, 'r', libver='latest', swmr=True) as f:
    assert f.swmr_mode
    x = np.array( f[dset_name] )

Thanks! However, I added this into my code:

 from torch.utils.data import Dataset, DataLoader 
 class H5Dataset(Dataset):
    def __init__(self, h5db_path):       
        h5db = h5.File(h5db_path,"r",libver='latest', swmr=True)
        self.X =  h5db["faces"]
        self.labels = h5db["labels"]
       

db_path = "train.h5"
h5db = H5Dataset(db_path)
# h5db = h5.File(h5db_path,"r",libver='latest', swmr=True) is implemented inside the H5dataset
data_op =  DataLoader(h5db, batch_size=BATCH_SIZE,shuffle=True,num_workers=2)

I still got the error:
OSError: Can't read data (wrong B-tree signature)

If I add the below code at the top:
import torch.multiprocessing as mp
mp.set_start_method('spawn')
New errors occur:
TypeError: can't pickle _thread._local objects

Any ideas?

RizhaoCai · 2019-02-20T07:32:49Z

@soumith when I add

import torch.multiprocessing as mp
mp.set_start_method('spawn')

into the top of my main script, another error occurred

Traceback (most recent call last):
  File "utility/dataset.py", line 237, in <module>
    data_iter = iter(train_loader)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 310, in __iter__
    return DataLoaderIter(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 167, in __init__
    w.start()
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 274, in _Popen
    return Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 33, in __init__
    super().__init__(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 48, in _launch
    reduction.dump(process_obj, fp)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_train_valid_loader.<locals>.<lambda>'

any idea? thx

well, this error can be fixed by directly apply Dataloader on dataset rather than use an additional function get_train_valid_loader to wrap it. But still not solve this problem, in my situation, it returns another error:

RuntimeError: context has already been set
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
    exitcode = _main(fd)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 115, in _main
    prepare(preparation_data)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 226, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 278, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 254, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/kaixuan/DATA/Papers/Code/utility/dataset.py", line 7, in <module>
    mp.set_start_method('spawn')
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 231, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Another odd thing is the process not returns as if it stuck.
Also note when I delete the two line added code, and set the num_workers to 1, it can return the correct result.

Encounter the same problem. Do you solve it? I mean, make it work with the num_workers>1

Mengman · 2019-10-12T10:31:27Z

I got this work on my code
my h5py version is 2.10.0
just enable 'swmr mode'

h5py.File(file_path, 'r', libver='latest', swmr=True)

and do not set torch multiprocessing to 'spawn'
~~import torch.multiprocessing as mp~~
~~mp.set_start_method('spawn')~~

turtleizzy · 2019-10-16T05:23:09Z

I got this work on my code
my h5py version is 2.10.0
just enable 'swmr mode'
h5py.File(file_path, 'r', libver='latest', swmr=True)
and do not set torch multiprocessing to 'spawn'
~~import torch.multiprocessing as mp~~
~~mp.set_start_method('spawn')~~

I couldn't get DataLoader to work for num_workers>1 even with this trick.

kl456123 · 2019-11-20T11:10:11Z

老铁别扯些没用的，最简单的方法就是加锁，multiprocessing.Lock

jshi31 · 2020-08-31T19:24:53Z

Do not write h5py.File(file_path, 'r') in __init__ function, write it in __get_item__, and judge if you have read it. For example:

def __get_item__(self, item):
        if self.env is None:
            self.env = h5py.File(self.hf_path, 'r')

In this case, you will not read one h5 file multiple times in multi-processing

thinkerww · 2021-06-23T15:19:21Z

请问大家解决了吗？我也遇到了这个问题了

HarveyYan · 2021-07-20T06:01:58Z

Just wanna add one small point here...When you are doing what @lumaku is suggested (which really works), i.e. opening the hdf5 in each worker process, make sure that you don't have any opened hdf5 in the parent process (or anywhere else in the program), otherwise it will still throw such errors as "Can't read data (inflate() failed)" or etc.

soumith closed this as completed Nov 1, 2017

fmassa mentioned this issue Apr 5, 2019

Completed code with bug report for hdf5 dataset. How to fix? #18951

Closed

noe mentioned this issue May 15, 2020

fairseq integration: h5py error when using multiple GPUs noe/seqp#3

Closed

torch.utils.data.DataLoader并行处理h5文件时错误,单线程正常,并行报错. #3415

torch.utils.data.DataLoader并行处理h5文件时错误,单线程正常,并行报错. #3415

Comments

flystarhe commented Nov 1, 2017

flystarhe commented Nov 1, 2017

Uh oh!

fmassa commented Nov 1, 2017

Uh oh!

soumith commented Nov 1, 2017

Uh oh!

soumith commented Nov 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flystarhe commented Nov 3, 2017

Uh oh!

Vandermode commented Dec 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

any idea? thx

Uh oh!

zhbbupt commented Dec 23, 2017

Uh oh!

flystarhe commented Dec 25, 2017

Uh oh!

ooteki commented May 31, 2018

Uh oh!

Jongchan commented Jul 2, 2018

Uh oh!

RizhaoCai commented Jan 26, 2019

Uh oh!

lumaku commented Feb 1, 2019

Uh oh!

RizhaoCai commented Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RizhaoCai commented Feb 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

any idea? thx

Uh oh!

Mengman commented Oct 12, 2019

Uh oh!

turtleizzy commented Oct 16, 2019

Uh oh!

kl456123 commented Nov 20, 2019

Uh oh!

jshi31 commented Aug 31, 2020

Uh oh!

thinkerww commented Jun 23, 2021

Uh oh!

HarveyYan commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

`torch.utils.data.DataLoader`并行处理h5文件时错误,单线程正常,并行报错. #3415

`torch.utils.data.DataLoader`并行处理h5文件时错误,单线程正常,并行报错. #3415

soumith commented Nov 1, 2017 •

edited

Loading

Vandermode commented Dec 8, 2017 •

edited

Loading

RizhaoCai commented Feb 20, 2019 •

edited

Loading

RizhaoCai commented Feb 20, 2019 •

edited

Loading

HarveyYan commented Jul 20, 2021 •

edited

Loading