Skip to content

Commit 92431f5

Browse files
liu-congahg-g
andauthored
Document model server compatibility and config options (#537)
* Document model server compatibility and config options * Update config/charts/inferencepool/README.md --------- Co-authored-by: Abdullah Gharaibeh <[email protected]>
1 parent 1ba13f3 commit 92431f5

File tree

6 files changed

+64
-4
lines changed

6 files changed

+64
-4
lines changed

config/charts/inferencepool/README.md

+13-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
A chart to deploy an InferencePool and a corresponding EndpointPicker (epp) deployment.
44

5-
65
## Install
76

87
To install an InferencePool named `vllm-llama3-8b-instruct` that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port `8000`, you can run the following command:
@@ -23,6 +22,18 @@ $ helm install vllm-llama3-8b-instruct \
2322

2423
Note that the provider name is needed to deploy provider-specific resources. If no provider is specified, then only the InferencePool object and the EPP are deployed.
2524

25+
### Install for Triton TensorRT-LLM
26+
27+
Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install for Triton TensorRT-LLM, e.g.,
28+
29+
```txt
30+
$ helm install triton-llama3-8b-instruct \
31+
--set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \
32+
--set inferencePool.modelServerType=triton-tensorrt-llm \
33+
--set provider.name=[none|gke] \
34+
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
35+
```
36+
2637
## Uninstall
2738

2839
Run the following command to uninstall the chart:
@@ -38,6 +49,7 @@ The following table list the configurable parameters of the chart.
3849
| **Parameter Name** | **Description** |
3950
|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
4051
| `inferencePool.targetPortNumber` | Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. |
52+
| `inferencePool.modelServerType` | Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm], default is vllm. |
4153
| `inferencePool.modelServers.matchLabels` | Label selector to match vllm backends managed by the inference pool. |
4254
| `inferenceExtension.replicas` | Number of replicas for the endpoint picker extension service. Defaults to `1`. |
4355
| `inferenceExtension.image.name` | Name of the container image used for the endpoint picker. |

config/charts/inferencepool/templates/epp-deployment.yaml

+8-1
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,14 @@ spec:
3535
- "9003"
3636
- -metricsPort
3737
- "9090"
38+
{{- if eq (.Values.inferencePool.modelServerType | default "vllm") "triton-tensorrt-llm" }}
39+
- -totalQueuedRequestsMetric
40+
- "nv_trt_llm_request_metrics{request_type=waiting}"
41+
- -kvCacheUsagePercentageMetric
42+
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
43+
- -loraInfoMetric
44+
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
45+
{{- end }}
3846
ports:
3947
- name: grpc
4048
containerPort: 9002
@@ -54,4 +62,3 @@ spec:
5462
service: inference-extension
5563
initialDelaySeconds: 5
5664
periodSeconds: 10
57-

config/charts/inferencepool/values.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ inferenceExtension:
99

1010
inferencePool:
1111
targetPortNumber: 8000
12+
modelServerType: vllm # vllm, triton-tensorrt-llm
1213
# modelServers: # REQUIRED
1314
# matchLabels:
1415
# app: vllm-llama3-8b-instruct

mkdocs.yml

+3-1
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,9 @@ nav:
5454
API Overview: concepts/api-overview.md
5555
Conformance: concepts/conformance.md
5656
Roles and Personas: concepts/roles-and-personas.md
57-
- Implementations: implementations.md
57+
- Implementations:
58+
- Gateways: implementations/gateways.md
59+
- Model Servers: implementations/model-servers.md
5860
- FAQ: faq.md
5961
- Guides:
6062
- User Guides:

site-src/implementations.md renamed to site-src/implementations/gateways.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Implementations
1+
# Gateway Implementations
22

33
This project has several implementations that are planned or in progress:
44

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
2+
3+
# Supported Model Servers
4+
5+
Any model server that conform to the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) are supported by the inference extension.
6+
7+
## Compatible Model Server Versions
8+
9+
| Model Server | Version | Commit | Notes |
10+
| -------------------- | ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
11+
| vLLM V0 | v0.6.4 and above | [commit 0ad216f](https://github.com/vllm-project/vllm/commit/0ad216f5750742115c686723bf38698372d483fd) | |
12+
| vLLM V1 | v0.8.0 and above | [commit bc32bc7](https://github.com/vllm-project/vllm/commit/bc32bc73aad076849ac88565cff745b01b17d89c) | |
13+
| Triton(TensorRT-LLM) | [25.03](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-03.html#rel-25-03) and above | [commit 15cb989](https://github.com/triton-inference-server/tensorrtllm_backend/commit/15cb989b00523d8e92dce5165b9b9846c047a70d). | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. |
14+
15+
## vLLM
16+
17+
vLLM is configured as the default in the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp). No further configuration is required.
18+
19+
## Triton with TensorRT-LLM Backend
20+
21+
Triton specific metric names need to be specified when starting the EPP.
22+
23+
### Option 1: Use Helm
24+
25+
Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install the [`inferencepool` via helm](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool). See the [`inferencepool` helm guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool/README.md) for more details.
26+
27+
### Option 2: Edit EPP deployment yaml
28+
29+
Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)
30+
31+
```
32+
- -totalQueuedRequestsMetric
33+
- "nv_trt_llm_request_metrics{request_type=waiting}"
34+
- -kvCacheUsagePercentageMetric
35+
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
36+
- -loraInfoMetric
37+
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
38+
```

0 commit comments

Comments
 (0)