Investigate GPU support #143

bpmct · 2024-04-25T15:10:57Z

Some users will want to mount a GPU to an envbuilder-backed workspace. Can we investigate in which scenarios (if any) this works today and if/how we can patch upstream Kaniko to improve the experience?

Try to run envbuilder in a regular environment
- Spin up a regular k8s cluster on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
Try to reproduce the user error by running envbuilder with a GPU
- Spin up a k8s cluster using a NVidia GPU on Google Cloud
- Try to run envbuilder with a hello world image and see if it works
- Try to find a workaround
- Investigate possible solutions using diff builders besides kaniko

BrunoQuaresma · 2024-05-23T15:22:43Z

@bpmct I tried to use envbuilder in a GPU environment and it worked as expected. Here is how I made it:

Spin up a k8s cluster with GPU support on GKE
- GKE version 1.27.13-gke.1000000
- Machine type n1-standard-4
- GPU accelerators (per node) 2 x NVIDIA T4
Setup a test repo with devcontainer using a Nvidia test image
- Example: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample
- NVidia example image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
Set the following envbuilder config
- GIT_URL as https://github.com/BrunoQuaresma/envbuilder-gpu-test
- INIT_SCRIPT as /tmp/vectorAdd

This is the output:

envbuilder - Build development environments from repositories in a container
#1: 📦 Cloning https://github.com/BrunoQuaresma/envbuilder-gpu-test to /workspaces/envbuilder-gpu-test...
#1: Enumerating objects: 4, done.
#1: Counting objects:  25% (1/4)
#1: Counting objects:  50% (2/4)
#1: Counting objects:  75% (3/4)
#1: Counting objects: 100% (4/4)
#1: Counting objects: 100% (4/4), done.
#1: Compressing objects:  50% (1/2)
#1: Compressing objects: 100% (2/2)
#1: Compressing objects: 100% (2/2), done.
#1: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
#1: 📦 Cloned repository! [193.807769ms]
#2: Deleting filesystem...
#2: 🏗️ Building image...
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Retrieving image nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 from registry nvcr.io
#2: Built cross stage deps: map[]
#2: Retrieving image manifest nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
#2: Returning cached image manifest
#2: Executing 0 build triggers
#2: Building stage 'nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2' [idx: '0', base-idx: '-1']
#2: 🏗️ Built image! [3.019338331s]
#3: no user specified, using root
#3: 🔄 Updating the ownership of the workspace...
#3: 👤 Updated the ownership of the workspace! [449.651µs]
=== Running the init command /bin/sh [-c /tmp/vectorAdd] as the "root" user...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

@bpmct do you think we can get more details from the user?

BrunoQuaresma · 2024-05-28T13:20:48Z

I am closing this for now until we have more context from the user.

marrotte · 2024-05-30T20:36:11Z

Try using the NVIDIA k8s device plugin (DaemonSet) and not a NVIDIA container image, e.g.:

https://github.com/NVIDIA/k8s-device-plugin

https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#nvidia-device-plugin-installation

This is the recommended approach when using GPU-enabled node pools for Azure Linux.

johnstcn · 2024-05-31T14:25:23Z

Try using the NVIDIA k8s device plugin (DaemonSet)

@marrotte FYI while we tested using GKE, the cluster we tested on does use the device plugin. However, it appears to be a customized version for GKE COS, and I will freely admit that cluster is a bit old.

What Kubernetes version are you seeing issues with on AKS?

not a NVIDIA container image, e.g.:

This container image is the one NVIDIA recommends to test GPU support (c.f. https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#running-gpu-jobs)

What image would you recommend instead as a test? I note that the Azure docs you linked reference a separate MNIST test image.

marrotte · 2024-05-31T15:34:38Z

@johnstcn I'm seeing the issue on:

K8s Rev: v1.27.7
Node image: AKSUbuntu-2204gen2containerd-202401.09.0
Plugin image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:1.11
Pod image = "ghcr.io/coder/envbuilder:0.2.9"

nikawang · 2024-06-13T12:13:56Z

@BrunoQuaresma
Still have issues via using your test repo on AKS agains ghcr.io/coder/envbuilder:0.2.9

root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/# cd /tmp
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ls
coder.wTqTN7  vectorAdd
root@coder-daniel-kkk-copy-75f96d6c7b-dxxdn:/tmp# ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

/usr/lib64# ll
total 176176
drwxr-xr-x  2 root root     4096 Jun 13 14:00 ./
drwxr-xr-x 14 root root     4096 Jun 13 13:49 ../
lrwxrwxrwx  1 root root       42 Jun 13 13:49 ld-linux-x86-64.so.2 -> /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2*
-rwxr-xr-x  1 root root 28392536 Jun 13 12:01 libcuda.so.550.54.15*
-rwxr-xr-x  1 root root 10524136 Jun 13 12:01 libcudadebugger.so.550.54.15*
-rwxr-xr-x  1 root root   168744 Jun 13 12:01 libnvidia-allocator.so.550.54.15*
-rwxr-xr-x  1 root root   398968 Jun 13 12:01 libnvidia-cfg.so.550.54.15*
lrwxrwxrwx  1 root root       36 Jun 13 14:00 libnvidia-ml.so -> /usr/lib64/libnvidia-ml.so.550.54.15*
-rwxr-xr-x  1 root root  2078360 Jun 13 12:01 libnvidia-ml.so.550.54.15*
-rwxr-xr-x  1 root root 86842616 Jun 13 12:01 libnvidia-nvvm.so.550.54.15*
-rwxr-xr-x  1 root root 23293568 Jun 13 12:01 libnvidia-opencl.so.550.54.15*
-rwxr-xr-x  1 root root    10168 Jun 13 12:01 libnvidia-pkcs11.so.550.54.15*
-rwxr-xr-x  1 root root 28670368 Jun 13 12:01 libnvidia-ptxjitcompiler.so.550.54.15*

nikawang · 2024-06-13T14:07:09Z

Oh, I fixed it by

echo "/usr/lib64" > /etc/ld.so.conf.d/customized.conf 
ldconfig

BrunoQuaresma · 2024-06-13T14:22:26Z

@marrotte does the @nikawang fix work for you?

marrotte · 2024-06-17T14:33:39Z

@BrunoQuaresma I don't think I can test that as my envbuilder fails to build. I believe @nikawang either applied that fix to a envbuilder running container or the running container built by envbuilder. I did try applying @nikawang's fix to the AKS/K8s GPU node as if that might be where it was applied and that had no effect.

BrunoQuaresma · 2024-06-17T19:18:38Z

@marrotte could you please share with me how I can set up a similar k8s cluster step by step or a Terraform file where I can just run it?

maxbrunet · 2024-06-26T18:42:24Z

So ENVBUILDER_IGNORE_PATHS can be set to /dev,/lib/firmware/nvidia,/usr/bin/nv-,/usr/bin/nvidia-,/usr/lib64/libcuda,/usr/lib64/libnvidia-,/var/run, but we hit the known unlinkat/device or resource busy error.

The easiest to get the right environment to reproduce is likely the gpu-operator for Kubernetes or the NVIDIA Container Toolkit for Docker.

I believe #183 (and #249) can provide a workaround here by temporarily remounting the path out of the way instead trying to ignore them in Kaniko, although note that mount/umount require privileges.

Currently only read-only mounts are taken care of, but the NVIDIA container runtime mounts devtmpfs filesystems at /var/run/nvidia-container-devices/GPU-<uuid> (the actual mountpoint can be /run since often /var/run is a symlink to it), the logic would need to be extended to cover those (I have successfully done that). No special handling needed, I probably had forgotten to add back /var/run to the ignored paths.

The runtime mounts libraries with symlinks:

libcuda.so -> libcuda.so.1
libcuda.so.1 -> libcuda.so.<driver-version>
libcuda.so.<driver-version>
libcudadebugger.so.1 -> libcudadebugger.so.<driver-version>
libcudadebugger.so.<driver-version>
libnvidia-allocator.so.1 -> libnvidia-allocator.so.<driver-version>
libnvidia-allocator.so.<driver-version>
libnvidia-cfg.so.1 -> libnvidia-cfg.so.<driver-version>
libnvidia-cfg.so.<driver-version>
libnvidia-ml.so.1 -> libnvidia-ml.so.<driver-version>
libnvidia-ml.so.<driver-version>
libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.<driver-version>
libnvidia-nvvm.so.<driver-version>
libnvidia-opencl.so.1 -> libnvidia-opencl.so.<driver-version>
libnvidia-opencl.so.<driver-version>
libnvidia-pkcs11-openssl3.so.<driver-version>
libnvidia-pkcs11.so.<driver-version>
libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.<driver-version>
libnvidia-ptxjitcompiler.so.<driver-version>

The symlinks must also be preserved. The location in the envbuilder image is /usr/lib64, but it differs between distros (for example in Debian, it should be /usr/lib/x86_64-linux-gnu), so the remount process must discover the appropriate location in the new filesystem hierarchy.

I am using this quick-and-dirty script afterward to get things working:

remount_and_resymlink.sh

#!/usr/bin/env bash
set -euo pipefail

TARGET=/usr/lib/x86_64-linux-gnu

FIRMWARES=(/lib/firmware/nvidia/*)
VERSION="${FIRMWARES[0]}"
VERSION="${VERSION##*/}"

mount | awk '/\/usr\/lib64/{print $3}' | while read -r path; do
  lib="${path##*/}"

  mkdir -p "${TARGET}"
  touch "${TARGET}/${lib}"
  mount --bind "${path}" "${TARGET}/${lib}"
  unmount "${path}"

  case "${lib}" in
    libnvidia-pkcs11.so.*) ;;
    libnvidia-pkcs11-openssl3.so.*) ;;
    libnvidia-nvvm.so.*)
      n=4
      ;;
    *)
      n=1
      ;;
  esac

  if [[ -n "${n:-}" ]]; then
    ln -s "${lib}" "${TARGET}/${lib%"${VERSION}"}${n}"
  fi

  if [[ "${lib}" == "libcuda.so."* ]]; then
    ln -s "${lib%"${VERSION}"}${n}" "${TARGET}/${lib%".${VERSION}"}"
  fi
done

This is the logic the runtime uses to pick the library directory: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_container.c#L151-L188

And this looks like the libraries it can potentially mount: https://github.com/NVIDIA/libnvidia-container/blob/v1.15.0/src/nvc_info.c#L75-L139

Once that is all in place, the nvidia-smi command should work, the GPU(s) should be visible as well as the CUDA version.

In an image with pytorch (e.g. nvcr.io/nvidia/pytorch:24.05-py3), python -c 'import torch; print(torch.cuda.is_available())' should return True.

~~One thing I have not figured out yet is why the container gets all GPUs when only 1 is requested (this works properly for a regular container) 😕~~ That's because the pod is running with privileges.

The manifest I am using at the moment:

pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: envbuilder 
spec:
  containers:
  - name: envbuilder
    image: ghcr.io/coder/envbuilder-preview
    env:
    - name: FALLBACK_IMAGE
      value: debian
    - name: INIT_SCRIPT
      value: sh -c 'while :; do sleep 86400; done' 
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
    resources:
      limits:
        nvidia.com/gpu: "1"
    securityContext:
      privileged: true

If you are on GCP/GKE, the above should be valid for Ubuntu nodes (I think, I am not testing there). ~~I need to investigate on GCP's ContainerOS too since things are wired a little differently.~~ On GCP's ContainerOS the only mount is /usr/local/nvidia, so this path can either be ignored or remounted and no care is given regarding the PATH or ldconfig search path by default, it has to be handle the user's image (e.g. LD_LIBRARY_PATH=/usr/local/nvidia/lib64 /usr/local/nvidia/bin/nvidia-smi should work).

adrianmlops · 2025-04-10T13:42:42Z

I've been experimenting with the envbuilder build process using Nvidia GPU images on EKS, and so far, the only way I can get it to succeed is by setting the container as privileged. However, this introduces a significant issue: the container ends up seeing all GPUs on the host, even when only one is requested. This breaks GPU isolation and is especially problematic in multi-user environments where resource separation is critical.

Here’s the security_context I’m currently using:

security_context {
  run_as_user = 0
  privileged  = true
}

I suspect that enabling privileged = true is bypassing Kubernetes’ standard GPU isolation via the device plugin, but I haven’t found a reliable alternative yet that allows the build to succeed.

Has anyone found a way to:

Avoid privileged mode while still building successfully with GPU images?
Or isolate GPU access even when privileged is required?

Any insights would be greatly appreciated!

coder-labeler bot added enhancement spike Investigation to prove feasibility or validate an idea labels Apr 25, 2024

bpmct mentioned this issue Apr 25, 2024

Envbuilder v1.0 release #132

Closed

36 tasks

mtojek assigned BrunoQuaresma May 9, 2024

BrunoQuaresma closed this as completed May 28, 2024

BrunoQuaresma reopened this May 31, 2024

This comment was marked as off-topic.

Sign in to view

johnstcn assigned johnstcn and unassigned BrunoQuaresma Jun 24, 2024

maxbrunet mentioned this issue Jun 28, 2024

fix(remount): relocate libraries along with their symlinks #255

Merged

johnstcn closed this as completed in #255 Jul 2, 2024

matifali removed the enhancement label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate GPU support #143

Investigate GPU support #143

bpmct commented Apr 25, 2024

mtojek commented May 9, 2024

BrunoQuaresma commented May 14, 2024

BrunoQuaresma commented May 23, 2024 •

edited

Loading

BrunoQuaresma commented May 28, 2024

marrotte commented May 30, 2024

johnstcn commented May 31, 2024 •

edited

Loading

marrotte commented May 31, 2024

nikawang commented Jun 13, 2024 •

edited

Loading

nikawang commented Jun 13, 2024

BrunoQuaresma commented Jun 13, 2024

marrotte commented Jun 17, 2024

BrunoQuaresma commented Jun 17, 2024

This comment was marked as off-topic.

maxbrunet commented Jun 26, 2024 •

edited

Loading

adrianmlops commented Apr 10, 2025 •

edited

Loading

Investigate GPU support #143

Investigate GPU support #143

Comments

bpmct commented Apr 25, 2024

Related

mtojek commented May 9, 2024

BrunoQuaresma commented May 14, 2024

BrunoQuaresma commented May 23, 2024 • edited Loading

BrunoQuaresma commented May 28, 2024

marrotte commented May 30, 2024

johnstcn commented May 31, 2024 • edited Loading

marrotte commented May 31, 2024

nikawang commented Jun 13, 2024 • edited Loading

nikawang commented Jun 13, 2024

BrunoQuaresma commented Jun 13, 2024

marrotte commented Jun 17, 2024

BrunoQuaresma commented Jun 17, 2024

This comment was marked as off-topic.

maxbrunet commented Jun 26, 2024 • edited Loading

adrianmlops commented Apr 10, 2025 • edited Loading

BrunoQuaresma commented May 23, 2024 •

edited

Loading

johnstcn commented May 31, 2024 •

edited

Loading

nikawang commented Jun 13, 2024 •

edited

Loading

maxbrunet commented Jun 26, 2024 •

edited

Loading

adrianmlops commented Apr 10, 2025 •

edited

Loading