PyTorch: increasing --shm-size to allow multiprocessing data loaders #937

ksanjeevan · 2019-07-15T22:05:03Z

Reference: MLFW-2582

System Information

Framework: PyTorch
Framework Version: 1.1.0
Python Version:py3
CPU or GPU: Both
Python SDK Version: 1.32.0
Are you using a custom image: Yes, N/A

Describe the problem

When using a data loader with multiprocessing in PyTorch (set num_workers > 0), the following error comes up:

algo-1-xxsq9_1  |   File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch
algo-1-xxsq9_1  |     raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
algo-1-xxsq9_1  | RuntimeError: DataLoader worker (pid(s) 76) exited unexpectedly
tmpitprkm4o_algo-1-xxsq9_1 exited with code 1

This is because /dev/shm is at only 64M by default. The solution to this seems to be simply passing --shm-size with a higher value to docker run but if one is using sagemaker that option isn't there.

I've extended the PyTorch container and added a bunch of custom packages and settings in the Dockerfile with no problem, but can't set runtime flags/args meant to be passed to docker run (Note: docker build has a --shm-size args but that is NOT related to the /dev/shm size of the final container).

This becomes a huge bottleneck since training is very slow with num_workers = 0. Can't we just increase the default shared memory? Or provide an easier way for users to set it using the sagemaker sdk?

Exact command to reproduce:
Any example using DataLoader with num_workers > 0.

The text was updated successfully, but these errors were encountered:

ChoiByungWook · 2019-07-16T03:21:19Z

Hello @ksanjeevan,

Thanks for bringing this issue to our attention.

Let me reach out to the team that owns the training platform.

ishaaq · 2019-09-05T08:41:42Z

Are you using local mode training?

If you're not using local mode then, in fact docker containers running in SageMaker training do NOT use the 64MB default shm-size - we adjust it depending on the instance type.

You didn't mention the instance type you're using, which is it? And how much is the max shared memory is your algorithm meant to use?

ksanjeevan · 2019-09-05T12:36:50Z

Yeah this was using local mode, using remote seems to be fine.

laurenyu · 2019-09-05T22:49:43Z

thanks for the clarification! So it seems that we would need to find a way to expose shm-size as an option that would then get written into the docker-compose.yml file that is used for local mode. I'll open up an item in our internal backlog, which we're always reprioritizing based on feedback.

some potentially helpful links for anyone wanting to take a stab at this:

code where the yaml file is written
docker compose shm-size option - note that this would also include upgrading the version we use for the file

ksanjeevan · 2019-09-09T08:22:14Z

Hi @laurenyu, yes thanks! In an ideal world, we could pass some kind of run_hyperparameters dictionary so that we can add any flag for docker run, but having shm-size is good enough.

bill10 · 2020-02-19T15:15:17Z

just wondering if this issue is resolved. This feature will really make debugging much easier.

austinmw · 2020-03-24T05:02:52Z

This is how you can monkey patch the sagemaker SDK to enable multiprocessing in local mode:

Example SageMaker SDK location:
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py:

def _create_docker_host(self, host, environment, optml_subdirs, command, volumes):
    """
    Args:
        host:
        environment:
        optml_subdirs:
        command:
        volumes:
    """
    optml_volumes = self._build_optml_volumes(host, optml_subdirs)
    optml_volumes.extend(volumes)

    # Added block (use 95% of total memory)
    from psutil import virtual_memory
    mem = virtual_memory()
    shm_size = str(int(int(str(mem.total)[:2])*.95))+'gb'

    host_config = {
        "image": self.image,
        "stdin_open": True,
        "tty": True,
        "volumes": [v.map for v in optml_volumes],
        "environment": environment,
        "command": command,
        "networks": {"sagemaker-local": {"aliases": [host]}},
        "shm_size": shm_size # Added line
    }
...

@laurenyu any chance we could get a PR with an option like this?

VladVin · 2020-11-02T10:31:16Z

Any updates? The feature would be usefull indeed for debugging locally

VladVin · 2020-11-03T17:45:12Z

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

Open this file in your editor:

sudo vim /etc/docker/daemon.json

Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.
Restart Docker daemon:

sudo systemctl restart docker

ivan-saptech · 2021-03-19T02:43:54Z

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

Open this file in your editor:
sudo vim /etc/docker/daemon.json
Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.

Restart Docker daemon:
sudo systemctl restart docker

Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.

arulbharathi · 2021-03-24T16:46:27Z

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

Open this file in your editor:
sudo vim /etc/docker/daemon.json
Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.

Restart Docker daemon:
sudo systemctl restart docker
Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.

@ivan-saptech will we able to deploy the same into sagemaker and the sagemaker endpoint will get the same shm-size configuration?

CC - @VladVin

lou-k · 2021-05-19T17:23:21Z

This is happening in sagemaker studio as well. Is there a way to adjust the studio settings?

jacquesboitreaud · 2021-11-03T08:48:35Z

This is also a bottleneck in the sagemaker processing jobs, where we cannot pass arguments to the docker run command. Any way to fix this ?

NeilJonkers · 2021-11-03T14:31:57Z

For processing jobs there exists no workaround to set --shm-size at the moment.

bertocast · 2022-03-20T16:02:11Z

Any updates on this?

lou-k · 2022-03-21T12:58:38Z

I've raised the issue to the sagemaker studio team nearly a year ago, but they have not taken any action.

titanium-cranium · 2022-07-03T23:38:37Z

Impossible to train popular YOLOv5 models in studio with such low shared memory (64M). Launching a single Notebook instance makes it possible as shared memory on the instance I spun up was 32G (ml.g4dn.4xlarge). If your premier service of Studio is less capable than the older Notebooks this should be called out and or prioritised. I spent quite some time trying to work out how I could use Studio before I found this post and simply dumped it in favour of a Notebook.

knikure · 2022-10-03T20:41:39Z

Thank you for reaching out! The latest SageMaker Studio release now offers increased shared memory on instances running Studio Apps. The memory size scales in proportion to the size of the instance being used. Please update your Studio Apps as described here, to see these improvements.

Ajay-Reva · 2022-10-11T07:22:32Z

@knikure What do I do when I run from the notebook? Because I'm facing the same issue with my processing job when running from the notebook.

Ajay-Reva · 2022-10-11T08:15:46Z

I'm running this on AWS studio with a notebook and my own processing container.

#000Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.6/queue.py", line 173, in get
    self.not_empty.wait(remaining)
  File "/opt/conda/lib/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 193) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 388, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1003, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 960, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 821, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 193) exited unexpectedly

thvasilo · 2024-01-26T23:06:09Z

Hello, is it possible to set the size of /dev/shm now for SageMaker training jobs? We're running into the same issue.

ntw-au · 2024-02-16T09:02:46Z

Hello, is it possible to set the size of /dev/shm now for SageMaker training jobs? We're running into the same issue.

I'm also keen to have either higher or configurable shared memory limits. We are using SageMaker Training with PyTorch in a custom container and use shared memory to cache the whole dataset and therefore avoid returning to S3 every epoch. This works on ml.p4d.24xlarge but not ml.g5.48xlarge, where the process runs out of shared memory. I really want to have the whole of system RAM available for caching, like I would when running outside a container.

gui-miotto · 2024-09-12T11:30:19Z

I solved my shared memory issues by setting pin_memory to True in all dataloaders (train, valid, test, etc.)

ChoiByungWook added type: feature request feature request labels Jul 16, 2019

mmaybeno mentioned this issue Mar 16, 2020

Training with aws Sagemaker stuck if more than one epoch #1353

Closed

austinmw mentioned this issue Mar 24, 2020

"socket.error: [Errno 111] Connection refused" while training with multiple workers apache/mxnet#11872

Closed

ajaykarpur removed the type: feature request label Jun 24, 2020

jdxyw mentioned this issue Mar 25, 2021

It works well on small dataset, but not well on larger dataset. justheuristic/prefetch_generator#7

Closed

knikure closed this as completed Oct 3, 2022

JakeWags mentioned this issue Apr 4, 2024

Deepssm MedVIC-Lab/shapeworks-cloud#337

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch: increasing --shm-size to allow multiprocessing data loaders #937

PyTorch: increasing --shm-size to allow multiprocessing data loaders #937

ksanjeevan commented Jul 15, 2019 •

edited by ChoiByungWook

Loading

ChoiByungWook commented Jul 16, 2019

ishaaq commented Sep 5, 2019 •

edited

Loading

ksanjeevan commented Sep 5, 2019

laurenyu commented Sep 5, 2019

ksanjeevan commented Sep 9, 2019

bill10 commented Feb 19, 2020

austinmw commented Mar 24, 2020 •

edited

Loading

VladVin commented Nov 2, 2020

VladVin commented Nov 3, 2020 •

edited

Loading

ivan-saptech commented Mar 19, 2021

arulbharathi commented Mar 24, 2021

lou-k commented May 19, 2021

jacquesboitreaud commented Nov 3, 2021

NeilJonkers commented Nov 3, 2021

bertocast commented Mar 20, 2022

lou-k commented Mar 21, 2022

titanium-cranium commented Jul 3, 2022

knikure commented Oct 3, 2022

Ajay-Reva commented Oct 11, 2022 •

edited

Loading

Ajay-Reva commented Oct 11, 2022

thvasilo commented Jan 26, 2024

ntw-au commented Feb 16, 2024

gui-miotto commented Sep 12, 2024

PyTorch: increasing --shm-size to allow multiprocessing data loaders #937

PyTorch: increasing --shm-size to allow multiprocessing data loaders #937

Comments

ksanjeevan commented Jul 15, 2019 • edited by ChoiByungWook Loading

System Information

Describe the problem

ChoiByungWook commented Jul 16, 2019

ishaaq commented Sep 5, 2019 • edited Loading

ksanjeevan commented Sep 5, 2019

laurenyu commented Sep 5, 2019

ksanjeevan commented Sep 9, 2019

bill10 commented Feb 19, 2020

austinmw commented Mar 24, 2020 • edited Loading

VladVin commented Nov 2, 2020

VladVin commented Nov 3, 2020 • edited Loading

ivan-saptech commented Mar 19, 2021

arulbharathi commented Mar 24, 2021

lou-k commented May 19, 2021

jacquesboitreaud commented Nov 3, 2021

NeilJonkers commented Nov 3, 2021

bertocast commented Mar 20, 2022

lou-k commented Mar 21, 2022

titanium-cranium commented Jul 3, 2022

knikure commented Oct 3, 2022

Ajay-Reva commented Oct 11, 2022 • edited Loading

Ajay-Reva commented Oct 11, 2022

thvasilo commented Jan 26, 2024

ntw-au commented Feb 16, 2024

gui-miotto commented Sep 12, 2024

ksanjeevan commented Jul 15, 2019 •

edited by ChoiByungWook

Loading

ishaaq commented Sep 5, 2019 •

edited

Loading

austinmw commented Mar 24, 2020 •

edited

Loading

VladVin commented Nov 3, 2020 •

edited

Loading

Ajay-Reva commented Oct 11, 2022 •

edited

Loading