-
Notifications
You must be signed in to change notification settings - Fork 43
fix(remount): relocate libraries along with their symlinks #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution!
I validated this fix on a Fedora 40 system with the NVidia runtime etc. installed (so no /usr/lib/x86_64-linux-gnu
), using both Docker (v27.0.2) and K3s (v1.29.6).
(Edit: for posterity, also verified working on an AL2 EKS cluster.)
In both cases, I was able to successfully build the image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
, run nvidia-smi
and /tmp/vectorAdd
.
My only things I would like to see changed are:
- More commenting for future readers/code-spelunkers
- Remove references to PPC arch; I don't know if we will ever support this.
There may also be a similar workaround needed for AMD/Vulkan cards, but this can be tested separately.
0997cbd
to
07d0f1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍
(cherry picked from commit 46a78fb)
Is there any way to get this working without requiring privileged mode? I’ve been experimenting with the envbuilder build process using NVIDIA GPU images on EKS, and so far, the only reliable way I’ve found to make the build succeed is by setting the container as privileged. Unfortunately, this introduces a significant issue: the container ends up seeing all GPUs on the host, even when only a single GPU is requested. This breaks GPU isolation and becomes a real problem in multi-user environments where proper resource separation is critical. Here’s the
Enabling Has anyone found a way to:
Any insights would be greatly appreciated! |
This PR adds:
After that the container should behave like a regular container created by the NVIDIA container runtime. Of course/unfortunately, the process of mounting/unmounting requires GPU containers to run with privileges:
https://www.man7.org/linux/man-pages/man2/mount.2.html
https://www.man7.org/linux/man-pages/man2/umount.2.html
The logic is not generalized to any symlinks or any directories, it only aims at providing compatibility with the NVIDIA container runtime for now.
More context can be found in this comment #143 (comment)
Tested with the following images:
docker.io/library/debian:bookworm
docker.io/library/fedora:40
nvcr.io/nvidia/pytorch:24.05-py3
Closes #143