Skip to content

Commit 2d325e8

Browse files
Update README.md
1 parent 3f1851a commit 2d325e8

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Gateway API Inference Extension
22

3-
The Gateway API Inference Extension - also known as an inference gateway - improves the tail latency and throughput of OpenAI completion requests when load balancing a group of LLM servers on Kubernetes with kv-cache awareness. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
3+
The Gateway API Inference Extension - also known as an inference gateway - improves the tail latency and throughput of LLM completion requests in the OpenAI protocol against Kubernetes-hosted model servers. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
44

5-
The inference gateway is intended for inference platform teams serving self-hosted large language models on Kubernetes. It requires a version of vLLM that supports the necessary metrics to predict traffic. It extends a cluster-local gateway supporting [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) such as Envoy Gateway, kGateway, or the GKE Gateway. The HttpRoute that accepts OpenAI-compatible requests and serves model responses can then be configured as a model provider underneath a higher level AI-Gateway like LiteLLM, Solo AI Gateway, or Apigee, allowing you to integrate local serving with model-as-a-service consumption.
5+
The inference gateway is intended for inference platform teams serving self-hosted large language models on Kubernetes. It requires a version of vLLM that supports the necessary metrics to predict traffic. It extends a cluster-local gateway supporting [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) - such as Envoy Gateway, kGateway, or the GKE Gateway - with a request scheduling algorithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing when model servers are highly loaded. The HttpRoute that accepts OpenAI-compatible requests and serves model responses can then be configured as a model provider underneath a higher level AI-Gateway like LiteLLM, Solo AI Gateway, or Apigee, allowing you to integrate local serving with model-as-a-service consumption.
66

77
See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation.
88

0 commit comments

Comments
 (0)