-
Notifications
You must be signed in to change notification settings - Fork 43
Investigate GPU support #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@BrunoQuaresma We may need to write a mini-RFC describing the status quo. |
After talk to @mtojek I think I have a good plan:
|
@bpmct I tried to use envbuilder in a GPU environment and it worked as expected. Here is how I made it:
This is the output:
@bpmct do you think we can get more details from the user? |
I am closing this for now until we have more context from the user. |
Try using the NVIDIA k8s device plugin (DaemonSet) and not a NVIDIA container image, e.g.: https://github.com/NVIDIA/k8s-device-plugin This is the recommended approach when using GPU-enabled node pools for Azure Linux. |
@marrotte FYI while we tested using GKE, the cluster we tested on does use the device plugin. However, it appears to be a customized version for GKE COS, and I will freely admit that cluster is a bit old. What Kubernetes version are you seeing issues with on AKS?
This container image is the one NVIDIA recommends to test GPU support (c.f. https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs) What image would you recommend instead as a test? I note that the Azure docs you linked reference a separate MNIST test image. |
@johnstcn I'm seeing the issue on:
|
@BrunoQuaresma
|
Oh, I fixed it by
|
@BrunoQuaresma I don't think I can test that as my envbuilder fails to build. I believe @nikawang either applied that fix to a envbuilder running container or the running container built by envbuilder. I did try applying @nikawang's fix to the AKS/K8s GPU node as if that might be where it was applied and that had no effect. |
@marrotte could you please share with me how I can set up a similar k8s cluster step by step or a Terraform file where I can just run it? |
This comment was marked as off-topic.
This comment was marked as off-topic.
So The easiest to get the right environment to reproduce is likely the gpu-operator for Kubernetes or the NVIDIA Container Toolkit for Docker. I believe #183 (and #249) can provide a workaround here by temporarily remounting the path out of the way instead trying to ignore them in Kaniko, although note that mount/umount require privileges.
The runtime mounts libraries with symlinks:
The symlinks must also be preserved. The location in the I am using this quick-and-dirty script afterward to get things working: remount_and_resymlink.sh#!/usr/bin/env bash
set -euo pipefail
TARGET=/usr/lib/x86_64-linux-gnu
FIRMWARES=(/lib/firmware/nvidia/*)
VERSION="${FIRMWARES[0]}"
VERSION="${VERSION##*/}"
mount | awk '/\/usr\/lib64/{print $3}' | while read -r path; do
lib="${path##*/}"
mkdir -p "${TARGET}"
touch "${TARGET}/${lib}"
mount --bind "${path}" "${TARGET}/${lib}"
unmount "${path}"
case "${lib}" in
libnvidia-pkcs11.so.*) ;;
libnvidia-pkcs11-openssl3.so.*) ;;
libnvidia-nvvm.so.*)
n=4
;;
*)
n=1
;;
esac
if [[ -n "${n:-}" ]]; then
ln -s "${lib}" "${TARGET}/${lib%"${VERSION}"}${n}"
fi
if [[ "${lib}" == "libcuda.so."* ]]; then
ln -s "${lib%"${VERSION}"}${n}" "${TARGET}/${lib%".${VERSION}"}"
fi
done This is the logic the runtime uses to pick the library directory: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_container.c#L151-L188 And this looks like the libraries it can potentially mount: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_info.c#L75-L139 Once that is all in place, the In an image with
The manifest I am using at the moment: pod.yamlapiVersion: v1
kind: Pod
metadata:
name: envbuilder
spec:
containers:
- name: envbuilder
image: ghcr.io/coder/envbuilder-preview
env:
- name: FALLBACK_IMAGE
value: debian
- name: INIT_SCRIPT
value: sh -c 'while :; do sleep 86400; done'
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
resources:
limits:
nvidia.com/gpu: "1"
securityContext:
privileged: true If you are on GCP/GKE, the above should be valid for Ubuntu nodes (I think, I am not testing there). |
I've been experimenting with the envbuilder build process using Nvidia GPU images on EKS, and so far, the only way I can get it to succeed is by setting the container as privileged. However, this introduces a significant issue: the container ends up seeing all GPUs on the host, even when only one is requested. This breaks GPU isolation and is especially problematic in multi-user environments where resource separation is critical. Here’s the
I suspect that enabling Has anyone found a way to:
Any insights would be greatly appreciated! |
Some users will want to mount a GPU to an envbuilder-backed workspace. Can we investigate in which scenarios (if any) this works today and if/how we can patch upstream Kaniko to improve the experience?
Related
The text was updated successfully, but these errors were encountered: