You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This project offers tools for AI Inference, enabling developers to build [Inference Gateways].
7
+
Gateway API Inference Extension optimizes self-hosting Generative Models on Kubernetes.
8
+
This is achieved by leveraging Envoy's [External Processing] (ext-proc) to extend any gateway that supports both ext-proc and [Gateway API] into an **[inference gateway]**.
8
9
9
-
[Inference Gateways]:#concepts-and-definitions
10
+
11
+
[Inference Gateway]:#concepts-and-definitions
10
12
11
13
## Concepts and Definitions
12
14
13
-
The following are some key industry terms that are important to understand for
15
+
The following specific terms to this project:
16
+
17
+
-**Inference Gateway (IGW)**: A proxy/load-balancer which has been coupled with an
18
+
`Endpoint Picker`. It provides optimized routing and load balancing for
workloads. It simplifies the deployment, management, and observability of AI
21
+
inference workloads.
22
+
-**Inference Scheduler**: An extendable component that makes decisions about which endpoint is optimal (best cost /
23
+
best performance) for an inference request based on `Metrics and Capabilities`
24
+
from [Model Serving](/docs/proposals/003-model-server-protocol/README.md).
25
+
-**Metrics and Capabilities**: Data provided by model serving platforms about
26
+
performance, availability and capabilities to optimize routing. Includes
27
+
things like [Prefix Cache] status or [LoRA Adapters] availability.
28
+
-**Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
29
+
30
+
31
+
The following are key industry terms that are important to understand for
14
32
this project:
15
33
16
34
-**Model**: A generative AI model that has learned patterns from data and is
@@ -26,36 +44,20 @@ this project:
26
44
(GPUs) that can be attached to Kubernetes nodes to speed up computations,
27
45
particularly for training and inference tasks.
28
46
29
-
And the following are more specific terms to this project:
30
-
31
-
-**Scheduler**: Makes decisions about which endpoint is optimal (best cost /
32
-
best performance) for an inference request based on `Metrics and Capabilities`
33
-
from [Model Serving](/docs/proposals/003-model-server-protocol/README.md).
34
-
-**Metrics and Capabilities**: Data provided by model serving platforms about
35
-
performance, availability and capabilities to optimize routing. Includes
36
-
things like [Prefix Cache] status or [LoRA Adapters] availability.
37
-
-**Endpoint Selector**: A `Scheduler` combined with `Metrics and Capabilities`
38
-
systems is often referred to together as an [Endpoint Selection Extension]
39
-
(this is also sometimes referred to as an "endpoint picker", or "EPP").
40
-
-**Inference Gateway**: A proxy/load-balancer which has been coupled with a
41
-
`Endpoint Selector`. It provides optimized routing and load balancing for
This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
58
+
This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **[inference gateway]** - supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
57
59
58
-
The inference gateway:
60
+
The Inference Gateway:
59
61
60
62
* Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
61
63
* Provides [Kubernetes-native declarative APIs](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/) to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
Copy file name to clipboardExpand all lines: site-src/index.md
+2-4Lines changed: 2 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -44,11 +44,9 @@ implementations](https://gateway-api.sigs.k8s.io/implementations/). As this
44
44
pattern stabilizes, we expect a wide set of these implementations to support
45
45
this project.
46
46
47
-
### Endpoint Selection Extension
47
+
### Endpoint Picker
48
48
49
-
As part of this project, we're building an initial reference extension. Over
50
-
time, we hope to see a wide variety of extensions emerge that follow this
51
-
pattern and provide a wide range of choices.
49
+
As part of this project, we've built the Endpoing Picker. A pluggable & extensible ext-proc deployment that implements [this architecture](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
0 commit comments