|
1 | 1 | # Getting started with Gateway API Inference Extension
|
2 | 2 |
|
3 |
| -To get started using our project follow this guide [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/release-v0.1.0/pkg/README.md)! |
| 3 | +This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running! |
| 4 | + |
| 5 | +### Requirements |
| 6 | + - Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher |
| 7 | + - A cluster with: |
| 8 | + - Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind, |
| 9 | + you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer). |
| 10 | + - 3 GPUs to run the sample model server. Adjust the number of replicas in `./manifests/vllm/deployment.yaml` as needed. |
| 11 | + |
| 12 | +### Steps |
| 13 | + |
| 14 | +1. **Deploy Sample Model Server** |
| 15 | + |
| 16 | + Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model. |
| 17 | + Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway. |
| 18 | + ```bash |
| 19 | + kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2 |
| 20 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/vllm/deployment.yaml |
| 21 | + ``` |
| 22 | + |
| 23 | +1. **Install the Inference Extension CRDs:** |
| 24 | + |
| 25 | + ```sh |
| 26 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.1.0/manifests.yaml |
| 27 | + ``` |
| 28 | + |
| 29 | +1. **Deploy InferenceModel** |
| 30 | + |
| 31 | + Deploy the sample InferenceModel which is configured to load balance traffic between the `tweet-summary-0` and `tweet-summary-1` |
| 32 | + [LoRA adapters](https://docs.vllm.ai/en/latest/features/lora.html) of the sample model server. |
| 33 | + ```bash |
| 34 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/inferencemodel.yaml |
| 35 | + ``` |
| 36 | + |
| 37 | +1. **Update Envoy Gateway Config to enable Patch Policy** |
| 38 | + |
| 39 | + Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run: |
| 40 | + ```bash |
| 41 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/enable_patch_policy.yaml |
| 42 | + kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system |
| 43 | + ``` |
| 44 | + Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again. |
| 45 | + |
| 46 | +1. **Deploy Gateway** |
| 47 | + |
| 48 | + ```bash |
| 49 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/gateway.yaml |
| 50 | + ``` |
| 51 | + > **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./manifests/gateway/ext-proc.yaml` file, and an additional `./manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.*** |
| 52 | +
|
| 53 | + Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: |
| 54 | + ```bash |
| 55 | + $ kubectl get gateway inference-gateway |
| 56 | + NAME CLASS ADDRESS PROGRAMMED AGE |
| 57 | + inference-gateway inference-gateway <MY_ADDRESS> True 22s |
| 58 | + ``` |
| 59 | + |
| 60 | +1. **Deploy the Inference Extension and InferencePool** |
| 61 | + |
| 62 | + ```bash |
| 63 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/ext_proc.yaml |
| 64 | + ``` |
| 65 | + |
| 66 | +1. **Deploy Envoy Gateway Custom Policies** |
| 67 | + |
| 68 | + ```bash |
| 69 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/extension_policy.yaml |
| 70 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/patch_policy.yaml |
| 71 | + ``` |
| 72 | + > **_NOTE:_** This is also per InferencePool, and will need to be configured to support the new pool should you wish to experiment further. |
| 73 | +
|
| 74 | +1. **OPTIONALLY**: Apply Traffic Policy |
| 75 | + |
| 76 | + For high-traffic benchmarking you can apply this manifest to avoid any defaults that can cause timeouts/errors. |
| 77 | + |
| 78 | + ```bash |
| 79 | + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/traffic_policy.yaml |
| 80 | + ``` |
| 81 | + |
| 82 | +1. **Try it out** |
| 83 | + |
| 84 | + Wait until the gateway is ready. |
| 85 | + |
| 86 | + ```bash |
| 87 | + IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') |
| 88 | + PORT=8081 |
| 89 | + |
| 90 | + curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ |
| 91 | + "model": "tweet-summary", |
| 92 | + "prompt": "Write as if you were a critic: San Francisco", |
| 93 | + "max_tokens": 100, |
| 94 | + "temperature": 0 |
| 95 | + }' |
| 96 | + ``` |
0 commit comments