You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Document model server compatibility and config options (#537)
* Document model server compatibility and config options
* Update config/charts/inferencepool/README.md
---------
Co-authored-by: Abdullah Gharaibeh <[email protected]>
Copy file name to clipboardExpand all lines: config/charts/inferencepool/README.md
+13-1
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,6 @@
2
2
3
3
A chart to deploy an InferencePool and a corresponding EndpointPicker (epp) deployment.
4
4
5
-
6
5
## Install
7
6
8
7
To install an InferencePool named `vllm-llama3-8b-instruct` that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port `8000`, you can run the following command:
Note that the provider name is needed to deploy provider-specific resources. If no provider is specified, then only the InferencePool object and the EPP are deployed.
25
24
25
+
### Install for Triton TensorRT-LLM
26
+
27
+
Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install for Triton TensorRT-LLM, e.g.,
|`inferencePool.targetPortNumber`| Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. |
52
+
|`inferencePool.modelServerType`| Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm], default is vllm. |
41
53
|`inferencePool.modelServers.matchLabels`| Label selector to match vllm backends managed by the inference pool. |
42
54
|`inferenceExtension.replicas`| Number of replicas for the endpoint picker extension service. Defaults to `1`. |
43
55
|`inferenceExtension.image.name`| Name of the container image used for the endpoint picker. |
Any model server that conform to the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) are supported by the inference extension.
| vLLM V0 | v0.6.4 and above |[commit 0ad216f](https://github.com/vllm-project/vllm/commit/0ad216f5750742115c686723bf38698372d483fd)||
12
+
| vLLM V1 | v0.8.0 and above |[commit bc32bc7](https://github.com/vllm-project/vllm/commit/bc32bc73aad076849ac88565cff745b01b17d89c)||
13
+
| Triton(TensorRT-LLM) |[25.03](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-03.html#rel-25-03) and above |[commit 15cb989](https://github.com/triton-inference-server/tensorrtllm_backend/commit/15cb989b00523d8e92dce5165b9b9846c047a70d). | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. |
14
+
15
+
## vLLM
16
+
17
+
vLLM is configured as the default in the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp). No further configuration is required.
18
+
19
+
## Triton with TensorRT-LLM Backend
20
+
21
+
Triton specific metric names need to be specified when starting the EPP.
22
+
23
+
### Option 1: Use Helm
24
+
25
+
Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install the [`inferencepool` via helm](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool). See the [`inferencepool` helm guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool/README.md) for more details.
26
+
27
+
### Option 2: Edit EPP deployment yaml
28
+
29
+
Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32)
0 commit comments