You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Slight cleanup of some of our readmes
* testing site build issue
* Adding a note that you need envoy gateway to work to use something that depends on envoy gateway
* Feedback fixes
* restructuring and feedback comments
* removing make install
Copy file name to clipboardExpand all lines: README.md
+3-15
Original file line number
Diff line number
Diff line change
@@ -8,25 +8,13 @@ This extension is intented to provide value to multiplexed LLM services on a sha
8
8
9
9
This project is currently in development.
10
10
11
-
For more rapid testing, our PoC is in the `./examples/` dir.
12
-
13
-
14
11
## Getting Started
15
12
16
-
**Install the CRDs into the cluster:**
17
-
18
-
```sh
19
-
make install
20
-
```
21
-
22
-
**Delete the APIs(CRDs) from the cluster:**
13
+
Follow this [README](./pkg/README.md) to get the inference-extension up and running on your cluster!
23
14
24
-
```sh
25
-
make uninstall
26
-
```
15
+
## Website
27
16
28
-
**Deploying the ext-proc image**
29
-
Refer to this [README](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/README.md) on how to deploy the Ext-Proc image.
17
+
Detailed documentation is available on our website: https://gateway-api-inference-extension.sigs.k8s.io/
This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running!
4
+
3
5
### Requirements
4
-
The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher.
6
+
- Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher
7
+
- A cluster that has built-in support for `ServiceType=LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running)
8
+
- For example, with Kind, you can follow these steps: https://kind.sigs.k8s.io/docs/user/loadbalancer
5
9
6
10
### Steps
7
11
@@ -11,30 +15,40 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.
11
15
Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
12
16
```bash
13
17
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN# Your Hugging Face Token with access to Llama2
1.**Update Envoy Gateway Config to enable Patch Policy**
25
35
26
36
Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run:
Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again.
32
42
33
43
1.**Deploy Gateway**
34
44
35
45
```bash
36
-
kubectl apply -f ./manifests/gateway.yaml
46
+
kubectl apply -f ./manifests/gateway/gateway.yaml
37
47
```
48
+
> **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./manifests/gateway/ext-proc.yaml` file, and an additional `./manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.***
49
+
50
+
51
+
38
52
39
53
1.**Deploy Ext-Proc**
40
54
@@ -45,8 +59,17 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.
@@ -63,10 +86,4 @@ The current manifests rely on Envoy Gateway [v1.2.1](https://gateway.envoyproxy.
63
86
"max_tokens": 100,
64
87
"temperature": 0
65
88
}'
66
-
```
67
-
68
-
## Scheduling Package in Ext Proc
69
-
The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request.
The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request.
0 commit comments