Skip to content

Commit c470005

Browse files
committed
address comments
1 parent cbc2639 commit c470005

File tree

1 file changed

+45
-30
lines changed

1 file changed

+45
-30
lines changed
Lines changed: 45 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,68 @@
1-
# Model Server Protocol for Gateway API Inference Extension
1+
# Endpoint Picker Protocol
22

3-
## Inference API Protocol
3+
The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's
4+
responsible for picking an endpoint from the `InferencePool`. A reference implementation can be
5+
found [here](../../../pkg/ext-proc/).
46

5-
The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
6-
and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
7-
supporting more API protocols.
7+
## Proxy Protocol
8+
9+
This is the protocol between the EPP and the proxy (e.g, Envoy).
10+
11+
The EPP MUST implement the Envoy
12+
[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor)protocol.
13+
14+
For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via
15+
adding the `target-pod` HTTP header in the request, or otherwise return an error.
16+
17+
## Model Server Protocol
818

9-
<details>
10-
<summary>Why?</summary>
11-
The extension makes intelligent request scheduling decisions based on certain information from the
12-
request body, such as the `model` field.
13-
</details>
19+
This is the protocol between the EPP and the model servers.
1420

15-
## Metrics Reporting
21+
### Inference API Protocol
22+
23+
The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
24+
and [Chat](https://platform.openai.com/docs/api-reference/chat) APIs.
25+
26+
### Metrics Reporting
1627

1728
The inference extension scrapes metrics from the model servers to make optimal request scheduling
18-
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
19-
the metric names may differ slightly in different model servers, the metric types MUST be the same.
20-
We will align with the
29+
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact
30+
metric names don't necessarily need to be the same as the recommended names here, however the
31+
metric types and semantics MUST follow this doc.
32+
33+
Note the requirements here are aligned with the
2134
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
2235
effort.
2336

24-
We also show the metrics in vLLM, which is already integrated into the inference extension. We will
25-
add more model servers once they are integrated.
37+
The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
38+
into the reference endpoint picker implementation.
2639

2740
| Metric | Type | Description | vLLM metric |
2841
| ----- | ---- | ---- | ---- |
2942
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
3043
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
3144

32-
## [Experimental] LoRA Adapter Serving
3345

34-
Model servers that support dynamic LoRA serving can gain additional benefit from the inference
35-
extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.
46+
### LoRA Adapter Serving
3647

37-
The inference extension expects the following behavior from compatible model servers.
48+
Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note
49+
the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA
50+
implementation.
3851

39-
* Support running multiple LoRA adapters in parallel in the same decode batch.
40-
* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
41-
the requested adapters in the current batch.
52+
The model servers MUST support serving a LoRA adapter specified in the `model` argument of the
53+
request, provided the requested adapter is valid.
4254

43-
The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :
55+
The model server MUST expose the following LoRA adapter information via a RESTful API with response
56+
in JSON :
4457

4558
* `Config`
4659
* `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
47-
* `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
48-
Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
49-
requested adapter.
60+
* `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to
61+
serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot
62+
load the requested adapter.
5063
* `State`
51-
* `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
52-
requests.
64+
* `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and
65+
ready to serve requests.
5366

5467
This is an example API endpoint and response:
5568
```
@@ -69,4 +82,6 @@ GET ${server_endpoint}/adapters/info
6982
```
7083

7184
NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
72-
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.
85+
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma
86+
separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086)
87+
to track integration efforts with vLLM.

0 commit comments

Comments
 (0)