-
Notifications
You must be signed in to change notification settings - Fork 1.2k
PyTorch: increasing --shm-size to allow multiprocessing data loaders #937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @ksanjeevan, Thanks for bringing this issue to our attention. Let me reach out to the team that owns the training platform. |
Are you using local mode training? If you're not using local mode then, in fact docker containers running in SageMaker training do NOT use the 64MB default shm-size - we adjust it depending on the instance type. You didn't mention the instance type you're using, which is it? And how much is the max shared memory is your algorithm meant to use? |
Yeah this was using local mode, using remote seems to be fine. |
thanks for the clarification! So it seems that we would need to find a way to expose shm-size as an option that would then get written into the docker-compose.yml file that is used for local mode. I'll open up an item in our internal backlog, which we're always reprioritizing based on feedback. some potentially helpful links for anyone wanting to take a stab at this:
|
Hi @laurenyu, yes thanks! In an ideal world, we could pass some kind of |
just wondering if this issue is resolved. This feature will really make debugging much easier. |
This is how you can monkey patch the sagemaker SDK to enable multiprocessing in local mode: Example SageMaker SDK location: def _create_docker_host(self, host, environment, optml_subdirs, command, volumes):
"""
Args:
host:
environment:
optml_subdirs:
command:
volumes:
"""
optml_volumes = self._build_optml_volumes(host, optml_subdirs)
optml_volumes.extend(volumes)
# Added block (use 95% of total memory)
from psutil import virtual_memory
mem = virtual_memory()
shm_size = str(int(int(str(mem.total)[:2])*.95))+'gb'
host_config = {
"image": self.image,
"stdin_open": True,
"tty": True,
"volumes": [v.map for v in optml_volumes],
"environment": environment,
"command": command,
"networks": {"sagemaker-local": {"aliases": [host]}},
"shm_size": shm_size # Added line
}
... @laurenyu any chance we could get a PR with an option like this? |
Any updates? The feature would be usefull indeed for debugging locally |
As a workaround I found a solution to change the default parameters of the Docker daemon. May be suitable for those who have rights to change them:
|
Thanks! This is the only valid solution to this issue in my case. Finally found your solution after a few hours of searching. |
@ivan-saptech will we able to deploy the same into sagemaker and the sagemaker endpoint will get the same shm-size configuration? CC - @VladVin |
This is happening in sagemaker studio as well. Is there a way to adjust the studio settings? |
This is also a bottleneck in the sagemaker processing jobs, where we cannot pass arguments to the |
For processing jobs there exists no workaround to set |
Any updates on this? |
I've raised the issue to the sagemaker studio team nearly a year ago, but they have not taken any action. |
Impossible to train popular YOLOv5 models in studio with such low shared memory (64M). Launching a single Notebook instance makes it possible as shared memory on the instance I spun up was 32G (ml.g4dn.4xlarge). If your premier service of Studio is less capable than the older Notebooks this should be called out and or prioritised. I spent quite some time trying to work out how I could use Studio before I found this post and simply dumped it in favour of a Notebook. |
Thank you for reaching out! The latest SageMaker Studio release now offers increased shared memory on instances running Studio Apps. The memory size scales in proportion to the size of the instance being used. Please update your Studio Apps as described here, to see these improvements. |
@knikure What do I do when I run from the notebook? Because I'm facing the same issue with my processing job when running from the notebook. |
I'm running this on AWS studio with a notebook and my own processing container.
|
Hello, is it possible to set the size of |
I'm also keen to have either higher or configurable shared memory limits. We are using SageMaker Training with PyTorch in a custom container and use shared memory to cache the whole dataset and therefore avoid returning to S3 every epoch. This works on |
I solved my shared memory issues by setting |
Reference: MLFW-2582
System Information
Describe the problem
When using a data loader with multiprocessing in PyTorch (set
num_workers > 0
), the following error comes up:This is because
/dev/shm
is at only 64M by default. The solution to this seems to be simply passing--shm-size
with a higher value todocker run
but if one is usingsagemaker
that option isn't there.I've extended the PyTorch container and added a bunch of custom packages and settings in the Dockerfile with no problem, but can't set runtime flags/args meant to be passed to
docker run
(Note:docker build
has a--shm-size
args but that is NOT related to the/dev/shm
size of the final container).This becomes a huge bottleneck since training is very slow with
num_workers = 0
. Can't we just increase the default shared memory? Or provide an easier way for users to set it using the sagemaker sdk?Any example using DataLoader with
num_workers > 0
.The text was updated successfully, but these errors were encountered: