|
2 | 2 | [](https://pkg.go.dev/sigs.k8s.io/gateway-api-inference-extension)
|
3 | 3 | [](/LICENSE)
|
4 | 4 |
|
5 |
| -# Gateway API Inference Extension |
| 5 | +# Gateway API Inference Extension (GIE) |
| 6 | + |
| 7 | +This project offers tools for AI Inference, enabling developers to build [Inference Gateways]. |
| 8 | + |
| 9 | +[Inference Gateways]:#concepts-and-definitions |
| 10 | + |
| 11 | +## Concepts and Definitions |
| 12 | + |
| 13 | +The following are some key industry terms that are important to understand for |
| 14 | +this project: |
| 15 | + |
| 16 | +- **Model**: A generative AI model that has learned patterns from data and is |
| 17 | + used for inference. Models vary in size and architecture, from smaller |
| 18 | + domain-specific models to massive multi-billion parameter neural networks that |
| 19 | + are optimized for diverse language tasks. |
| 20 | +- **Inference**: The process of running a generative AI model, such as a large |
| 21 | + language model, diffusion model etc, to generate text, embeddings, or other |
| 22 | + outputs from input data. |
| 23 | +- **Model server**: A service (in our case, containerized) responsible for |
| 24 | + receiving inference requests and returning predictions from a model. |
| 25 | +- **Accelerator**: specialized hardware, such as Graphics Processing Units |
| 26 | + (GPUs) that can be attached to Kubernetes nodes to speed up computations, |
| 27 | + particularly for training and inference tasks. |
| 28 | + |
| 29 | +And the following are more specific terms to this project: |
| 30 | + |
| 31 | +- **Scheduler**: Makes decisions about which endpoint is optimal (best cost / |
| 32 | + best performance) for an inference request based on `Metrics and Capabilities` |
| 33 | + from [Model Serving](/docs/proposals/003-model-server-protocol/README.md). |
| 34 | +- **Metrics and Capabilities**: Data provided by model serving platforms about |
| 35 | + performance, availability and capabilities to optimize routing. Includes |
| 36 | + things like [Prefix Cache] status or [LoRA Adapters] availability. |
| 37 | +- **Endpoint Selector**: A `Scheduler` combined with `Metrics and Capabilities` |
| 38 | + systems is often referred to together as an [Endpoint Selection Extension] |
| 39 | + (this is also sometimes referred to as an "endpoint picker", or "EPP"). |
| 40 | +- **Inference Gateway**: A proxy/load-balancer which has been coupled with a |
| 41 | + `Endpoint Selector`. It provides optimized routing and load balancing for |
| 42 | + serving Kubernetes self-hosted generative Artificial Intelligence (AI) |
| 43 | + workloads. It simplifies the deployment, management, and observability of AI |
| 44 | + inference workloads. |
| 45 | + |
| 46 | +For deeper insights and more advanced concepts, refer to our [proposals](/docs/proposals). |
| 47 | + |
| 48 | +[Inference]:https://www.digitalocean.com/community/tutorials/llm-inference-optimization |
| 49 | +[Gateway API]:https://github.com/kubernetes-sigs/gateway-api |
| 50 | +[Prefix Cache]:https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html |
| 51 | +[LoRA Adapters]:https://docs.vllm.ai/en/stable/features/lora.html |
| 52 | +[Endpoint Selection Extension]:https://gateway-api-inference-extension.sigs.k8s.io/#endpoint-selection-extension |
| 53 | + |
| 54 | +## Technical Overview |
6 | 55 |
|
7 | 56 | This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
|
8 | 57 |
|
|
0 commit comments