WIP - Configure the vllm deployment with best practices for startup

smarterclayton · smarterclayton · commit 05d7a278083f · 2025-03-20T16:28:04.000-04:00
We want to recommend best practices for deployments of model servers
under an InferencePool. Use the need to gracefully drain without
client visible errors during rollout ("hitless" updates) to
annotate the yaml with strong opinions on best practices.
diff --git a/config/manifests/vllm/gpu-deployment.yaml b/config/manifests/vllm/gpu-deployment.yaml
@@ -46,26 +46,92 @@ spec:
             - containerPort: 8000
               name: http
               protocol: TCP
+          lifecycle:
+            preStop:
+              # vLLM stops accepting connections when it receives SIGTERM, so we need to sleep
+              # to give upstream gateways a chance to take us out of rotation. The time we wait
+              # is dependent on the time it takes for all upstreams to completely remove us from
+              # rotation. Older or simpler load balancers might take upwards of 30s, but we expect
+              # our deployment to run behind a modern gateway like Envoy which is designed to 
+              # probe for readiness aggressively.
+              sleep:
+                # Upstream gateway probers for health should be set on a low period, such as 5s
+                # and the shorter we can tighten that bound the faster that we release
+                # accelerators during controlled shutdowns.
+                seconds: 7
           livenessProbe:
-            failureThreshold: 240
             httpGet:
               path: /health
               port: http
               scheme: HTTP
-            initialDelaySeconds: 5
-            periodSeconds: 5
+            # vLLM's health check is simple, so we can more aggressively probe it.  Liveness
+            # check endpoints should always be suitable for aggressive probing.
+            periodSeconds: 1
             successThreshold: 1
+            # vLLM has a very simple health implementation, which means that any failure is
+            # likely significant. However, any liveness triggered restart requires the very
+            # large core model to be reloaded, and so we should bias towards ensuring the
+            # server is definitely unhealthy vs immediately restarting. Use 5 attempts as
+            # evidence of a serious problem.
+            failureThreshold: 5
             timeoutSeconds: 1
           readinessProbe:
-            failureThreshold: 600
             httpGet:
               path: /health
               port: http
               scheme: HTTP
-            initialDelaySeconds: 5
-            periodSeconds: 5
+            # vLLM's health check is simple, so we can more aggressively probe it.  Readiness
+            # check endpoints should always be suitable for aggressive probing, but may be
+            # slightly more expensive than readiness probes.
+            periodSeconds: 1
             successThreshold: 1
+            # vLLM has a very simple health implementation, which means that any failure is
+            # likely significant,
+            failureThreshold: 1
             timeoutSeconds: 1
+          # We set a startup probe so that we don't begin directing traffic to this instance
+          # until the model is loaded.
+          startupProbe:
+            # Failure threshold is when we believe startup will not happen at all, and is set
+            # to the maximum possible time we believe loading a model will take. In our
+            # default configuration we are downloading a model from HuggingFace, which may
+            # take a long time, then the model must load into the accelerator. We choose
+            # 10 minutes as a reasonable maximum startup time before giving up and attempting
+            # to restart the pod.
+            #
+            # IMPORTANT: If the core model takes more than 10 minutes to load, pods will crash
+            # loop forever. Be sure to set this appropriately.
+            failureThreshold: 600
+            # Set delay to start low so that if the base model changes to something smaller
+            # or an optimization is deployed, we don't wait unneccesarily.
+            initialDelaySeconds: 2
+            # As a startup probe, this stops running and so we can more aggressively probe
+            # even a moderately complex startup - this is a very important workload.
+            periodSeconds: 1
+            exec:
+              # Verify that our core model is loaded before we consider startup successful.
+              # /health starts returning true very early in vLLM startup, but we want to
+              # only consider ourselves as started up once the model has been loaded.
+              #
+              # vLLM should implement a readiness check that is only true once the model
+              # can begin serving, and then this can be switched to an httpGet probe.
+              command:
+              - /bin/bash
+              - -c
+              - |
+                set -eu
+                if ! models="$( curl -q http://0.0.0.0:8000/v1/models )"; then
+                  echo "server not responding"
+                  exit 1
+                fi
+                echo "${models}" | grep -q "$1"
+                if [[ $? -ne 0 ]]; then
+                  echo "model not found"
+                  exit 1
+                fi
+                echo "ok"
+              - ''
+              - '"id":"meta-llama/Llama-2-7b-hf"'
           resources:
             limits:
               nvidia.com/gpu: 1