Skip to content

Commit c2e3fa9

Browse files
authored
Updating Readme (#831)
1 parent 409fc3f commit c2e3fa9

File tree

2 files changed

+27
-27
lines changed

2 files changed

+27
-27
lines changed

README.md

Lines changed: 25 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,33 @@
22
[![Go Reference](https://pkg.go.dev/badge/sigs.k8s.io/gateway-api-inference-extension.svg)](https://pkg.go.dev/sigs.k8s.io/gateway-api-inference-extension)
33
[![License](https://img.shields.io/github/license/kubernetes-sigs/gateway-api-inference-extension)](/LICENSE)
44

5-
# Gateway API Inference Extension (GIE)
5+
# Gateway API Inference Extension
66

7-
This project offers tools for AI Inference, enabling developers to build [Inference Gateways].
7+
Gateway API Inference Extension optimizes self-hosting Generative Models on Kubernetes.
8+
This is achieved by leveraging Envoy's [External Processing] (ext-proc) to extend any gateway that supports both ext-proc and [Gateway API] into an **[inference gateway]**.
89

9-
[Inference Gateways]:#concepts-and-definitions
10+
11+
[Inference Gateway]:#concepts-and-definitions
1012

1113
## Concepts and Definitions
1214

13-
The following are some key industry terms that are important to understand for
15+
The following specific terms to this project:
16+
17+
- **Inference Gateway (IGW)**: A proxy/load-balancer which has been coupled with an
18+
`Endpoint Picker`. It provides optimized routing and load balancing for
19+
serving Kubernetes self-hosted generative Artificial Intelligence (AI)
20+
workloads. It simplifies the deployment, management, and observability of AI
21+
inference workloads.
22+
- **Inference Scheduler**: An extendable component that makes decisions about which endpoint is optimal (best cost /
23+
best performance) for an inference request based on `Metrics and Capabilities`
24+
from [Model Serving](/docs/proposals/003-model-server-protocol/README.md).
25+
- **Metrics and Capabilities**: Data provided by model serving platforms about
26+
performance, availability and capabilities to optimize routing. Includes
27+
things like [Prefix Cache] status or [LoRA Adapters] availability.
28+
- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
29+
30+
31+
The following are key industry terms that are important to understand for
1432
this project:
1533

1634
- **Model**: A generative AI model that has learned patterns from data and is
@@ -26,36 +44,20 @@ this project:
2644
(GPUs) that can be attached to Kubernetes nodes to speed up computations,
2745
particularly for training and inference tasks.
2846

29-
And the following are more specific terms to this project:
30-
31-
- **Scheduler**: Makes decisions about which endpoint is optimal (best cost /
32-
best performance) for an inference request based on `Metrics and Capabilities`
33-
from [Model Serving](/docs/proposals/003-model-server-protocol/README.md).
34-
- **Metrics and Capabilities**: Data provided by model serving platforms about
35-
performance, availability and capabilities to optimize routing. Includes
36-
things like [Prefix Cache] status or [LoRA Adapters] availability.
37-
- **Endpoint Selector**: A `Scheduler` combined with `Metrics and Capabilities`
38-
systems is often referred to together as an [Endpoint Selection Extension]
39-
(this is also sometimes referred to as an "endpoint picker", or "EPP").
40-
- **Inference Gateway**: A proxy/load-balancer which has been coupled with a
41-
`Endpoint Selector`. It provides optimized routing and load balancing for
42-
serving Kubernetes self-hosted generative Artificial Intelligence (AI)
43-
workloads. It simplifies the deployment, management, and observability of AI
44-
inference workloads.
4547

4648
For deeper insights and more advanced concepts, refer to our [proposals](/docs/proposals).
4749

4850
[Inference]:https://www.digitalocean.com/community/tutorials/llm-inference-optimization
4951
[Gateway API]:https://github.com/kubernetes-sigs/gateway-api
5052
[Prefix Cache]:https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html
5153
[LoRA Adapters]:https://docs.vllm.ai/en/stable/features/lora.html
52-
[Endpoint Selection Extension]:https://gateway-api-inference-extension.sigs.k8s.io/#endpoint-selection-extension
54+
[External Processing]:https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter
5355

5456
## Technical Overview
5557

56-
This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
58+
This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **[inference gateway]** - supporting inference platform teams self-hosting Generative Models (with a current focus on large language models) on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
5759

58-
The inference gateway:
60+
The Inference Gateway:
5961

6062
* Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases
6163
* Provides [Kubernetes-native declarative APIs](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/) to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades

site-src/index.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,9 @@ implementations](https://gateway-api.sigs.k8s.io/implementations/). As this
4444
pattern stabilizes, we expect a wide set of these implementations to support
4545
this project.
4646

47-
### Endpoint Selection Extension
47+
### Endpoint Picker
4848

49-
As part of this project, we're building an initial reference extension. Over
50-
time, we hope to see a wide variety of extensions emerge that follow this
51-
pattern and provide a wide range of choices.
49+
As part of this project, we've built the Endpoing Picker. A pluggable & extensible ext-proc deployment that implements [this architecture](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
5250

5351
### Model Server Frameworks
5452

0 commit comments

Comments
 (0)