address comments

liu-cong · liu-cong · commit c4700052594b · 2025-01-27T18:16:22.000-08:00
diff --git a/docs/proposals/003-model-server-protocol/protocol.md b/docs/proposals/003-model-server-protocol/protocol.md
@@ -1,55 +1,68 @@
-# Model Server Protocol for Gateway API Inference Extension
+# Endpoint Picker Protocol
 
-## Inference API Protocol
+The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's
+responsible for picking an endpoint from the `InferencePool`. A reference implementation can be
+found [here](../../../pkg/ext-proc/).
 
-The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
-and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
-supporting more API protocols.
+## Proxy Protocol
+
+This is the protocol between the EPP and the proxy (e.g, Envoy).
+
+The EPP MUST implement the Envoy
+[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor)protocol.
+
+For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via
+adding the `target-pod` HTTP header in the request, or otherwise return an error.
+
+## Model Server Protocol
 
-<details>
-<summary>Why?</summary>
-The extension makes intelligent request scheduling decisions based on certain information from the
-request body, such as the `model` field.
-</details>
+This is the protocol between the EPP and the model servers.
 
-## Metrics Reporting
+### Inference API Protocol
+
+The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
+and [Chat](https://platform.openai.com/docs/api-reference/chat) APIs.
+
+### Metrics Reporting
 
 The inference extension scrapes metrics from the model servers to make optimal request scheduling
-decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
-the metric names may differ slightly in different model servers, the metric types MUST be the same.
-We will align with the
+decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact
+metric names don't necessarily need to be the same as the recommended names here, however the
+metric types and semantics MUST follow this doc.
+
+Note the requirements here are aligned with the
 [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
 effort.
 
-We also show the metrics in vLLM, which is already integrated into the inference extension. We will
-add more model servers once they are integrated.
+The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
+into the reference endpoint picker implementation.
 
 | Metric | Type | Description | vLLM metric |
 | ----- | ---- | ---- | ---- |
 | TotalQueuedRequests         | Gauge     | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
 | KVCacheUtilization| Gauge     | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
 
-## [Experimental] LoRA Adapter Serving
 
-Model servers that support dynamic LoRA serving can gain additional benefit from the inference
-extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.
+### LoRA Adapter Serving
 
-The inference extension expects the following behavior from compatible model servers.
+Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note
+the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA 
+implementation.
 
-* Support running multiple LoRA adapters in parallel in the same decode batch.
-* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
-  the requested adapters in the current batch.
+The model servers MUST support serving a LoRA adapter specified in the `model` argument of the
+request, provided the requested adapter is valid.
 
-The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :
+The model server MUST expose the following LoRA adapter information via a RESTful API with response
+in JSON :
 
 * `Config` 
   * `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
-  *  `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
-  Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
-  requested adapter. 
+  *  `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to
+  serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot
+  load the requested adapter. 
 * `State`
-  * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
-  requests.
+  * `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and
+  ready to serve requests.
 
 This is an example API endpoint and response:
 ```
@@ -69,4 +82,6 @@ GET ${server_endpoint}/adapters/info
 ```
 
 NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
-`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.
+`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma
+separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086)
+to track integration efforts with vLLM.