Skip to content

torch.utils.data.DataLoader并行处理h5文件时错误,单线程正常,并行报错. #3415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flystarhe opened this issue Nov 1, 2017 · 20 comments

Comments

@flystarhe
Copy link

torch.utils.data.DataLoader并行处理h5文件时错误,单线程正常,并行报错.

代码如下:

    print('==> Loading datasets')
    train_set = DatasetFromHdf5(opt.trainSet)
    training_data_loader = DataLoader(dataset=train_set, num_workers=opt.threads, batch_size=opt.batchSize, shuffle=True)

错误如下:

OSError: Traceback (most recent call last):
  File "/home/hejian/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 40, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/hejian/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 40, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/data2/slyx-sr/src/dataset.py", line 14, in __getitem__
    return torch.from_numpy(self.data[index,:,:,:]).float(), torch.from_numpy(self.target[index,:,:,:]).float()
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2846)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2804)
  File "/home/hejian/anaconda3/lib/python3.5/site-packages/h5py/_hl/dataset.py", line 494, in __getitem__
    self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2846)
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_objects.c:2804)
  File "h5py/h5d.pyx", line 181, in h5py.h5d.DatasetID.read (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/h5d.c:3413)
  File "h5py/_proxy.pyx", line 130, in h5py._proxy.dset_rw (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_proxy.c:2008)
  File "h5py/_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread (/home/ilan/minonda/conda-bld/h5py_1496916508360/work/h5py/_proxy.c:1656)
OSError: Can't read data (Wrong b-tree signature)
@flystarhe
Copy link
Author

class DatasetFromHdf5(data.Dataset):
    def __init__(self, file_path):
        super(DatasetFromHdf5, self).__init__()
        hf = h5py.File(file_path)
        self.data = hf.get('data')
        self.target = hf.get('label')

    def __getitem__(self, index):
        return torch.from_numpy(self.data[index,:,:,:]).float(), torch.from_numpy(self.target[index,:,:,:]).float()

    def __len__(self):
        return self.data.shape[0]

@fmassa
Copy link
Member

fmassa commented Nov 1, 2017

Maybe hdf5 is not thread safe? Does it work without threads?

@soumith
Copy link
Member

soumith commented Nov 1, 2017

this is a HDF5 issue. The problem is that HDF5 concurrent reads aren't safe:
pandas-dev/pandas#12236
pandas-dev/pandas#14692

To actually allow concurrent reads for a file you have to use SWMR feature of HDF5: https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html

@soumith soumith closed this as completed Nov 1, 2017
@soumith
Copy link
Member

soumith commented Nov 1, 2017

Actually this thread gives proper workarounds as well: https://stackoverflow.com/questions/34906652/does-hdf5-support-concurrent-reads-or-writes-to-different-files

I think if you use python 3 and at the top of your main script (not the dataset), before you import h5py, if you add the following line it will be fixed:

import torch.multiprocessing as mp
mp.set_start_method('spawn')

@flystarhe
Copy link
Author

@soumith thanks

@Vandermode
Copy link

Vandermode commented Dec 8, 2017

@soumith when I add

import torch.multiprocessing as mp
mp.set_start_method('spawn')

into the top of my main script, another error occurred

Traceback (most recent call last):
  File "utility/dataset.py", line 237, in <module>
    data_iter = iter(train_loader)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 310, in __iter__
    return DataLoaderIter(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 167, in __init__
    w.start()
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 274, in _Popen
    return Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 33, in __init__
    super().__init__(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 48, in _launch
    reduction.dump(process_obj, fp)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_train_valid_loader.<locals>.<lambda>'

any idea? thx

well, this error can be fixed by directly apply Dataloader on dataset rather than use an additional function get_train_valid_loader to wrap it. But still not solve this problem, in my situation, it returns another error:

RuntimeError: context has already been set
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
    exitcode = _main(fd)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 115, in _main
    prepare(preparation_data)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 226, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 278, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 254, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/kaixuan/DATA/Papers/Code/utility/dataset.py", line 7, in <module>
    mp.set_start_method('spawn')
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 231, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Another odd thing is the process not returns as if it stuck.
Also note when I delete the two line added code, and set the num_workers to 1, it can return the correct result.

@zhbbupt
Copy link

zhbbupt commented Dec 23, 2017

你解决这个问题了吗? @flystarhe

@flystarhe
Copy link
Author

@zhbbupt 没有,我拖鞋了,单进程运行的

@ooteki
Copy link

ooteki commented May 31, 2018

同样的问题,单线程能正常读取,多线程就出错。

@Jongchan
Copy link

Jongchan commented Jul 2, 2018

I think you can bypass the runtime error by exception handler.
try: set_start_method('spawn') except RuntimeError: pass
It's mentioned in #3492 (comment)

I am not 100% sure that it will work, but you can try. It's just 3 more lines of codes.

@RizhaoCai
Copy link

同样的问题,单线程能正常读取,多线程就出错。

请问你解决这个问题了吗?难道h5 dataset只能单线程来了?

@lumaku
Copy link

lumaku commented Feb 1, 2019

@RizhaoCai You can read a HDF5-file with multithreading using the SWMR feature in the newer h5py library version.
There is a way to compile the hdf5 library to be thread-safe, but the h5py-version I got from my distribution was not compiled this way. Instead of opening the file in the parent process, I opened the file in each worker for each read (which has a certain overhead, but worked for me), for example:

with h5py.File(file_name, 'r', libver='latest', swmr=True) as f:
    assert f.swmr_mode
    x = np.array( f[dset_name] )

@RizhaoCai
Copy link

RizhaoCai commented Feb 20, 2019

@RizhaoCai You can read a HDF5-file with multithreading using the SWMR feature in the newer h5py library version.
There is a way to compile the hdf5 library to be thread-safe, but the h5py-version I got from my distribution was not compiled this way. Instead of opening the file in the parent process, I opened the file in each worker for each read (which has a certain overhead, but worked for me), for example:

with h5py.File(file_name, 'r', libver='latest', swmr=True) as f:
    assert f.swmr_mode
    x = np.array( f[dset_name] )

Thanks! However, I added this into my code:

 from torch.utils.data import Dataset, DataLoader 
 class H5Dataset(Dataset):
    def __init__(self, h5db_path):       
        h5db = h5.File(h5db_path,"r",libver='latest', swmr=True)
        self.X =  h5db["faces"]
        self.labels = h5db["labels"]
       

db_path = "train.h5"
h5db = H5Dataset(db_path)
# h5db = h5.File(h5db_path,"r",libver='latest', swmr=True) is implemented inside the H5dataset
data_op =  DataLoader(h5db, batch_size=BATCH_SIZE,shuffle=True,num_workers=2)

I still got the error:
OSError: Can't read data (wrong B-tree signature)

If I add the below code at the top:
import torch.multiprocessing as mp
mp.set_start_method('spawn')
New errors occur:
TypeError: can't pickle _thread._local objects

Any ideas?

@RizhaoCai
Copy link

RizhaoCai commented Feb 20, 2019

@soumith when I add

import torch.multiprocessing as mp
mp.set_start_method('spawn')

into the top of my main script, another error occurred

Traceback (most recent call last):
  File "utility/dataset.py", line 237, in <module>
    data_iter = iter(train_loader)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 310, in __iter__
    return DataLoaderIter(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 167, in __init__
    w.start()
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 274, in _Popen
    return Popen(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 33, in __init__
    super().__init__(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/popen_spawn_posix.py", line 48, in _launch
    reduction.dump(process_obj, fp)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_train_valid_loader.<locals>.<lambda>'

any idea? thx

well, this error can be fixed by directly apply Dataloader on dataset rather than use an additional function get_train_valid_loader to wrap it. But still not solve this problem, in my situation, it returns another error:

RuntimeError: context has already been set
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 106, in spawn_main
    exitcode = _main(fd)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 115, in _main
    prepare(preparation_data)
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 226, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/spawn.py", line 278, in _fixup_main_from_path
    run_name="__mp_main__")
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 254, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/kaixuan/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/media/kaixuan/DATA/Papers/Code/utility/dataset.py", line 7, in <module>
    mp.set_start_method('spawn')
  File "/home/kaixuan/anaconda3/lib/python3.5/multiprocessing/context.py", line 231, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

Another odd thing is the process not returns as if it stuck.
Also note when I delete the two line added code, and set the num_workers to 1, it can return the correct result.

Encounter the same problem. Do you solve it? I mean, make it work with the num_workers>1

@Mengman
Copy link

Mengman commented Oct 12, 2019

I got this work on my code
my h5py version is 2.10.0
just enable 'swmr mode'

h5py.File(file_path, 'r', libver='latest', swmr=True)

and do not set torch multiprocessing to 'spawn'
import torch.multiprocessing as mp
mp.set_start_method('spawn')

@turtleizzy
Copy link

I got this work on my code
my h5py version is 2.10.0
just enable 'swmr mode'

h5py.File(file_path, 'r', libver='latest', swmr=True)

and do not set torch multiprocessing to 'spawn'
import torch.multiprocessing as mp
mp.set_start_method('spawn')

I couldn't get DataLoader to work for num_workers>1 even with this trick.

@kl456123
Copy link

老铁别扯些没用的, 最简单的方法就是加锁,multiprocessing.Lock

@jshi31
Copy link

jshi31 commented Aug 31, 2020

Do not write h5py.File(file_path, 'r') in __init__ function, write it in __get_item__, and judge if you have read it. For example:

def __get_item__(self, item):
        if self.env is None:
            self.env = h5py.File(self.hf_path, 'r')

In this case, you will not read one h5 file multiple times in multi-processing

@thinkerww
Copy link

请问大家解决了吗?我也遇到了这个问题了

@HarveyYan
Copy link

HarveyYan commented Jul 20, 2021

Just wanna add one small point here...When you are doing what @lumaku is suggested (which really works), i.e. opening the hdf5 in each worker process, make sure that you don't have any opened hdf5 in the parent process (or anywhere else in the program), otherwise it will still throw such errors as "Can't read data (inflate() failed)" or etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests