Skip to content

Commit 05d7a27

Browse files
WIP - Configure the vllm deployment with best practices for startup
We want to recommend best practices for deployments of model servers under an InferencePool. Use the need to gracefully drain without client visible errors during rollout ("hitless" updates) to annotate the yaml with strong opinions on best practices.
1 parent 03d8584 commit 05d7a27

File tree

1 file changed

+72
-6
lines changed

1 file changed

+72
-6
lines changed

config/manifests/vllm/gpu-deployment.yaml

Lines changed: 72 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -46,26 +46,92 @@ spec:
4646
- containerPort: 8000
4747
name: http
4848
protocol: TCP
49+
lifecycle:
50+
preStop:
51+
# vLLM stops accepting connections when it receives SIGTERM, so we need to sleep
52+
# to give upstream gateways a chance to take us out of rotation. The time we wait
53+
# is dependent on the time it takes for all upstreams to completely remove us from
54+
# rotation. Older or simpler load balancers might take upwards of 30s, but we expect
55+
# our deployment to run behind a modern gateway like Envoy which is designed to
56+
# probe for readiness aggressively.
57+
sleep:
58+
# Upstream gateway probers for health should be set on a low period, such as 5s
59+
# and the shorter we can tighten that bound the faster that we release
60+
# accelerators during controlled shutdowns.
61+
seconds: 7
4962
livenessProbe:
50-
failureThreshold: 240
5163
httpGet:
5264
path: /health
5365
port: http
5466
scheme: HTTP
55-
initialDelaySeconds: 5
56-
periodSeconds: 5
67+
# vLLM's health check is simple, so we can more aggressively probe it. Liveness
68+
# check endpoints should always be suitable for aggressive probing.
69+
periodSeconds: 1
5770
successThreshold: 1
71+
# vLLM has a very simple health implementation, which means that any failure is
72+
# likely significant. However, any liveness triggered restart requires the very
73+
# large core model to be reloaded, and so we should bias towards ensuring the
74+
# server is definitely unhealthy vs immediately restarting. Use 5 attempts as
75+
# evidence of a serious problem.
76+
failureThreshold: 5
5877
timeoutSeconds: 1
5978
readinessProbe:
60-
failureThreshold: 600
6179
httpGet:
6280
path: /health
6381
port: http
6482
scheme: HTTP
65-
initialDelaySeconds: 5
66-
periodSeconds: 5
83+
# vLLM's health check is simple, so we can more aggressively probe it. Readiness
84+
# check endpoints should always be suitable for aggressive probing, but may be
85+
# slightly more expensive than readiness probes.
86+
periodSeconds: 1
6787
successThreshold: 1
88+
# vLLM has a very simple health implementation, which means that any failure is
89+
# likely significant,
90+
failureThreshold: 1
6891
timeoutSeconds: 1
92+
# We set a startup probe so that we don't begin directing traffic to this instance
93+
# until the model is loaded.
94+
startupProbe:
95+
# Failure threshold is when we believe startup will not happen at all, and is set
96+
# to the maximum possible time we believe loading a model will take. In our
97+
# default configuration we are downloading a model from HuggingFace, which may
98+
# take a long time, then the model must load into the accelerator. We choose
99+
# 10 minutes as a reasonable maximum startup time before giving up and attempting
100+
# to restart the pod.
101+
#
102+
# IMPORTANT: If the core model takes more than 10 minutes to load, pods will crash
103+
# loop forever. Be sure to set this appropriately.
104+
failureThreshold: 600
105+
# Set delay to start low so that if the base model changes to something smaller
106+
# or an optimization is deployed, we don't wait unneccesarily.
107+
initialDelaySeconds: 2
108+
# As a startup probe, this stops running and so we can more aggressively probe
109+
# even a moderately complex startup - this is a very important workload.
110+
periodSeconds: 1
111+
exec:
112+
# Verify that our core model is loaded before we consider startup successful.
113+
# /health starts returning true very early in vLLM startup, but we want to
114+
# only consider ourselves as started up once the model has been loaded.
115+
#
116+
# vLLM should implement a readiness check that is only true once the model
117+
# can begin serving, and then this can be switched to an httpGet probe.
118+
command:
119+
- /bin/bash
120+
- -c
121+
- |
122+
set -eu
123+
if ! models="$( curl -q http://0.0.0.0:8000/v1/models )"; then
124+
echo "server not responding"
125+
exit 1
126+
fi
127+
echo "${models}" | grep -q "$1"
128+
if [[ $? -ne 0 ]]; then
129+
echo "model not found"
130+
exit 1
131+
fi
132+
echo "ok"
133+
- ''
134+
- '"id":"meta-llama/Llama-2-7b-hf"'
69135
resources:
70136
limits:
71137
nvidia.com/gpu: 1

0 commit comments

Comments
 (0)