Skip to content

PyTorch: increasing --shm-size to allow multiprocessing data loaders #937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ksanjeevan opened this issue Jul 15, 2019 · 23 comments
Closed

Comments

@ksanjeevan
Copy link

ksanjeevan commented Jul 15, 2019

Reference: MLFW-2582

System Information

  • Framework: PyTorch
  • Framework Version: 1.1.0
  • Python Version:py3
  • CPU or GPU: Both
  • Python SDK Version: 1.32.0
  • Are you using a custom image: Yes, N/A

Describe the problem

When using a data loader with multiprocessing in PyTorch (set num_workers > 0), the following error comes up:

algo-1-xxsq9_1  |   File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 519, in _try_get_batch
algo-1-xxsq9_1  |     raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
algo-1-xxsq9_1  | RuntimeError: DataLoader worker (pid(s) 76) exited unexpectedly
tmpitprkm4o_algo-1-xxsq9_1 exited with code 1

This is because /dev/shm is at only 64M by default. The solution to this seems to be simply passing --shm-size with a higher value to docker run but if one is using sagemaker that option isn't there.

I've extended the PyTorch container and added a bunch of custom packages and settings in the Dockerfile with no problem, but can't set runtime flags/args meant to be passed to docker run (Note: docker build has a --shm-size args but that is NOT related to the /dev/shm size of the final container).

This becomes a huge bottleneck since training is very slow with num_workers = 0. Can't we just increase the default shared memory? Or provide an easier way for users to set it using the sagemaker sdk?

  • Exact command to reproduce:
    Any example using DataLoader with num_workers > 0.
@ChoiByungWook
Copy link
Contributor

Hello @ksanjeevan,

Thanks for bringing this issue to our attention.

Let me reach out to the team that owns the training platform.

@ishaaq
Copy link
Contributor

ishaaq commented Sep 5, 2019

Are you using local mode training?

If you're not using local mode then, in fact docker containers running in SageMaker training do NOT use the 64MB default shm-size - we adjust it depending on the instance type.

You didn't mention the instance type you're using, which is it? And how much is the max shared memory is your algorithm meant to use?

@ksanjeevan
Copy link
Author

Yeah this was using local mode, using remote seems to be fine.

@laurenyu
Copy link
Contributor

laurenyu commented Sep 5, 2019

thanks for the clarification! So it seems that we would need to find a way to expose shm-size as an option that would then get written into the docker-compose.yml file that is used for local mode. I'll open up an item in our internal backlog, which we're always reprioritizing based on feedback.

some potentially helpful links for anyone wanting to take a stab at this:

@ksanjeevan
Copy link
Author

Hi @laurenyu, yes thanks! In an ideal world, we could pass some kind of run_hyperparameters dictionary so that we can add any flag for docker run, but having shm-size is good enough.

@bill10
Copy link

bill10 commented Feb 19, 2020

just wondering if this issue is resolved. This feature will really make debugging much easier.

@austinmw
Copy link

austinmw commented Mar 24, 2020

This is how you can monkey patch the sagemaker SDK to enable multiprocessing in local mode:

Example SageMaker SDK location:
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py:

def _create_docker_host(self, host, environment, optml_subdirs, command, volumes):
    """
    Args:
        host:
        environment:
        optml_subdirs:
        command:
        volumes:
    """
    optml_volumes = self._build_optml_volumes(host, optml_subdirs)
    optml_volumes.extend(volumes)

    # Added block (use 95% of total memory)
    from psutil import virtual_memory
    mem = virtual_memory()
    shm_size = str(int(int(str(mem.total)[:2])*.95))+'gb'

    host_config = {
        "image": self.image,
        "stdin_open": True,
        "tty": True,
        "volumes": [v.map for v in optml_volumes],
        "environment": environment,
        "command": command,
        "networks": {"sagemaker-local": {"aliases": [host]}},
        "shm_size": shm_size # Added line
    }
...

@laurenyu any chance we could get a PR with an option like this?

@VladVin
Copy link

VladVin commented Nov 2, 2020

Any updates? The feature would be usefull indeed for debugging locally

@VladVin
Copy link

VladVin commented Nov 3, 2020

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

  1. Open this file in your editor:
sudo vim /etc/docker/daemon.json
  1. Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.
  2. Restart Docker daemon:
sudo systemctl restart docker

@ivan-saptech
Copy link

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

  1. Open this file in your editor:
sudo vim /etc/docker/daemon.json
  1. Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.
  2. Restart Docker daemon:
sudo systemctl restart docker

Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.

@arulbharathi
Copy link

As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:

  1. Open this file in your editor:
sudo vim /etc/docker/daemon.json
  1. Add option "default-shm-size": "13G" as mentioned in the Docker docs. You can specify another value, I just set 13Gb as I have 16Gb of RAM on my server.
  2. Restart Docker daemon:
sudo systemctl restart docker

Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching.

@ivan-saptech will we able to deploy the same into sagemaker and the sagemaker endpoint will get the same shm-size configuration?

CC - @VladVin

@lou-k
Copy link

lou-k commented May 19, 2021

This is happening in sagemaker studio as well. Is there a way to adjust the studio settings?

@jacquesboitreaud
Copy link

This is also a bottleneck in the sagemaker processing jobs, where we cannot pass arguments to the docker run command. Any way to fix this ?

@NeilJonkers
Copy link
Contributor

For processing jobs there exists no workaround to set --shm-size at the moment.

@bertocast
Copy link

Any updates on this?

@lou-k
Copy link

lou-k commented Mar 21, 2022

I've raised the issue to the sagemaker studio team nearly a year ago, but they have not taken any action.

@titanium-cranium
Copy link

Impossible to train popular YOLOv5 models in studio with such low shared memory (64M). Launching a single Notebook instance makes it possible as shared memory on the instance I spun up was 32G (ml.g4dn.4xlarge). If your premier service of Studio is less capable than the older Notebooks this should be called out and or prioritised. I spent quite some time trying to work out how I could use Studio before I found this post and simply dumped it in favour of a Notebook.

@knikure
Copy link
Contributor

knikure commented Oct 3, 2022

Thank you for reaching out! The latest SageMaker Studio release now offers increased shared memory on instances running Studio Apps. The memory size scales in proportion to the size of the instance being used. Please update your Studio Apps as described here, to see these improvements.

@knikure knikure closed this as completed Oct 3, 2022
@Ajay-Reva
Copy link

Ajay-Reva commented Oct 11, 2022

@knikure What do I do when I run from the notebook? Because I'm facing the same issue with my processing job when running from the notebook.

@Ajay-Reva
Copy link

I'm running this on AWS studio with a notebook and my own processing container.

#000Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.6/queue.py", line 173, in get
    self.not_empty.wait(remaining)
  File "/opt/conda/lib/python3.6/threading.py", line 299, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 193) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 388, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1003, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 960, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 821, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 193) exited unexpectedly

@thvasilo
Copy link

Hello, is it possible to set the size of /dev/shm now for SageMaker training jobs? We're running into the same issue.

@ntw-au
Copy link

ntw-au commented Feb 16, 2024

Hello, is it possible to set the size of /dev/shm now for SageMaker training jobs? We're running into the same issue.

I'm also keen to have either higher or configurable shared memory limits. We are using SageMaker Training with PyTorch in a custom container and use shared memory to cache the whole dataset and therefore avoid returning to S3 every epoch. This works on ml.p4d.24xlarge but not ml.g5.48xlarge, where the process runs out of shared memory. I really want to have the whole of system RAM available for caching, like I would when running outside a container.

@gui-miotto
Copy link

I solved my shared memory issues by setting pin_memory to True in all dataloaders (train, valid, test, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests