Update README.md

smarterclayton · web-flow · commit 2d325e89718c · 2025-02-19T14:56:00.000-05:00
diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 # Gateway API Inference Extension 
 
-The Gateway API Inference Extension - also known as an inference gateway - improves the tail latency and throughput of OpenAI completion requests when load balancing a group of LLM servers on Kubernetes with kv-cache awareness. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
+The Gateway API Inference Extension - also known as an inference gateway - improves the tail latency and throughput of LLM completion requests in the OpenAI protocol against Kubernetes-hosted model servers. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
 
-The inference gateway is intended for inference platform teams serving self-hosted large language models on Kubernetes. It requires a version of vLLM that supports the necessary metrics to predict traffic.  It extends a cluster-local gateway supporting [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) such as Envoy Gateway, kGateway, or the GKE Gateway. The HttpRoute that accepts OpenAI-compatible requests and serves model responses can then be configured as a model provider underneath a higher level AI-Gateway like LiteLLM, Solo AI Gateway, or Apigee, allowing you to integrate local serving with model-as-a-service consumption.
+The inference gateway is intended for inference platform teams serving self-hosted large language models on Kubernetes. It requires a version of vLLM that supports the necessary metrics to predict traffic.  It extends a cluster-local gateway supporting [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) - such as Envoy Gateway, kGateway, or the GKE Gateway - with a request scheduling algorithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing when model servers are highly loaded. The HttpRoute that accepts OpenAI-compatible requests and serves model responses can then be configured as a model provider underneath a higher level AI-Gateway like LiteLLM, Solo AI Gateway, or Apigee, allowing you to integrate local serving with model-as-a-service consumption.
 
 See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation.