You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/proposals/003-model-server-protocol/protocol.md
+19-53Lines changed: 19 additions & 53 deletions
Original file line number
Diff line number
Diff line change
@@ -15,63 +15,47 @@ request body, such as the `model` field.
15
15
## Metrics Reporting
16
16
17
17
The inference extension scrapes metrics from the model servers to make optimal request scheduling
18
-
decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While
18
+
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
19
19
the metric names may differ slightly in different model servers, the metric types MUST be the same.
20
20
We will align with the
21
21
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
22
22
effort.
23
23
24
24
We also show the metrics in vLLM, which is already integrated into the inference extension. We will
25
-
add more model server once they are integrated.
25
+
add more model servers once they are integrated.
26
26
27
27
| Metric | Type | Description | vLLM metric |
28
28
| ----- | ---- | ---- | ---- |
29
29
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.|`vllm:num_requests_waiting`|
30
30
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.|`vllm:gpu_cache_usage_perc`|
31
31
32
-
33
-
### Future Metrics
34
-
The following metrics MAY be needed in the future for further optimization.
35
-
36
-
| Metric |Type | Description | vLLM metric |
37
-
| ----- | ---- | ---- | ---- |
38
-
| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.|`vllm:num_tokens_running`|
39
-
| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.|`vllm:num_tokens_waiting` (need to be added)|
40
-
| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.|`vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). |
41
-
| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. |`vllm:time_to_first_token_seconds`|
42
-
| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. |`vllm:time_per_output_token_seconds`|
43
-
44
-
## LoRA Adapter Serving
45
-
46
-
### Dynamic LoRA Serving
32
+
## [Experimental] LoRA Adapter Serving
47
33
48
34
Model servers that support dynamic LoRA serving can gain additional benefit from the inference
49
-
extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there
50
-
is no common standard, the inference extension generally expects the following behavior.
35
+
extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.
36
+
37
+
The inference extension expects the following behavior from compatible model servers.
51
38
52
39
* Support running multiple LoRA adapters in parallel in the same decode batch.
53
-
* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on
40
+
* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
54
41
the requested adapters in the current batch.
55
42
56
-
The model server SHOULD expose the following information via an API:
43
+
The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :
57
44
58
-
*AdapterConfig
59
-
* LoRAEnabled: Whether dynamic LoRA serving is enabled.
60
-
* MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch.
45
+
*`Config`
46
+
*`LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
47
+
*`MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
61
48
Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
62
-
requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the
63
-
`vllm:lora_requests_info` metric.
64
-
* AdapterState
65
-
* ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce
66
-
requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters`
67
-
in the `vllm:lora_requests_info` metric.
68
-
69
-
The API MAY look like this:
49
+
requested adapter.
50
+
*`State`
51
+
*`ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
52
+
requests.
53
+
54
+
This is an example API endpoint and response:
70
55
```
71
56
GET ${server_endpoint}/adapters/info
72
57
```
73
58
74
-
And the response MAY look like this:
75
59
```
76
60
{
77
61
"config": {
@@ -84,23 +68,5 @@ And the response MAY look like this:
84
68
}
85
69
```
86
70
87
-
#### Dynamically Register/Unregister Adapters
88
-
89
-
Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters).
90
-
This enables platform teams to multiplex multiple LoRA adapters on shared model servers and
91
-
dynamically rollout LoRA adapters.
92
-
93
-
NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD
94
-
integration.
95
-
96
-
While we don’t intend to dictate how model servers should implement this API, a reference REST API
97
-
MAY look this:
98
-
99
-
```
100
-
POST ${server_endpoint}/adapters/{adapter-id}
101
-
{
102
-
"path": "path/to/my/adapter"
103
-
}
104
-
105
-
DELETE ${server_endpoint}/adapters/{adapter-id}
106
-
```
71
+
NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
72
+
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.
0 commit comments