You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via
15
+
adding the `target-pod` HTTP header in the request, or otherwise return an error.
16
+
17
+
## Model Server Protocol
8
18
9
-
<details>
10
-
<summary>Why?</summary>
11
-
The extension makes intelligent request scheduling decisions based on certain information from the
12
-
request body, such as the `model` field.
13
-
</details>
19
+
This is the protocol between the EPP and the model servers.
14
20
15
-
## Metrics Reporting
21
+
### Inference API Protocol
22
+
23
+
The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
24
+
and [Chat](https://platform.openai.com/docs/api-reference/chat) APIs.
25
+
26
+
### Metrics Reporting
16
27
17
28
The inference extension scrapes metrics from the model servers to make optimal request scheduling
18
-
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
19
-
the metric names may differ slightly in different model servers, the metric types MUST be the same.
20
-
We will align with the
29
+
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. The exact
30
+
metric names don't necessarily need to be the same as the recommended names here, however the
31
+
metric types and semantics MUST follow this doc.
32
+
33
+
Note the requirements here are aligned with the
21
34
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
22
35
effort.
23
36
24
-
We also show the metrics in vLLM, which is already integrated into the inference extension. We will
25
-
add more model servers once they are integrated.
37
+
The corresponding metrics in vLLM are also shown in the table below, as vLLM is already integrated
38
+
into the reference endpoint picker implementation.
26
39
27
40
| Metric | Type | Description | vLLM metric |
28
41
| ----- | ---- | ---- | ---- |
29
42
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.|`vllm:num_requests_waiting`|
30
43
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.|`vllm:gpu_cache_usage_perc`|
31
44
32
-
## [Experimental] LoRA Adapter Serving
33
45
34
-
Model servers that support dynamic LoRA serving can gain additional benefit from the inference
35
-
extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.
46
+
### LoRA Adapter Serving
36
47
37
-
The inference extension expects the following behavior from compatible model servers.
48
+
Model servers that support dynamic LoRA serving can benefit from the LoRA affinity algorithm. Note
49
+
the current algorithm in the reference EPP is highly biased towards vLLM's current dynamic LoRA
50
+
implementation.
38
51
39
-
* Support running multiple LoRA adapters in parallel in the same decode batch.
40
-
* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
41
-
the requested adapters in the current batch.
52
+
The model servers MUST support serving a LoRA adapter specified in the `model` argument of the
53
+
request, provided the requested adapter is valid.
42
54
43
-
The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :
55
+
The model server MUST expose the following LoRA adapter information via a RESTful API with response
56
+
in JSON :
44
57
45
58
*`Config`
46
59
*`LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
47
-
*`MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
48
-
Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
49
-
requested adapter.
60
+
*`MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to
61
+
serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot
62
+
load the requested adapter.
50
63
*`State`
51
-
*`ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
52
-
requests.
64
+
*`ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and
65
+
ready to serve requests.
53
66
54
67
This is an example API endpoint and response:
55
68
```
@@ -69,4 +82,6 @@ GET ${server_endpoint}/adapters/info
69
82
```
70
83
71
84
NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
72
-
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.
85
+
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma
86
+
separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086)
0 commit comments