Skip to content

Commit cbc2639

Browse files
committed
Remove future work and focus on current release
1 parent a5e340c commit cbc2639

File tree

1 file changed

+19
-53
lines changed

1 file changed

+19
-53
lines changed

docs/proposals/003-model-server-protocol/protocol.md

Lines changed: 19 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -15,63 +15,47 @@ request body, such as the `model` field.
1515
## Metrics Reporting
1616

1717
The inference extension scrapes metrics from the model servers to make optimal request scheduling
18-
decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While
18+
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
1919
the metric names may differ slightly in different model servers, the metric types MUST be the same.
2020
We will align with the
2121
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
2222
effort.
2323

2424
We also show the metrics in vLLM, which is already integrated into the inference extension. We will
25-
add more model server once they are integrated.
25+
add more model servers once they are integrated.
2626

2727
| Metric | Type | Description | vLLM metric |
2828
| ----- | ---- | ---- | ---- |
2929
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
3030
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
3131

32-
33-
### Future Metrics
34-
The following metrics MAY be needed in the future for further optimization.
35-
36-
| Metric |Type | Description | vLLM metric |
37-
| ----- | ---- | ---- | ---- |
38-
| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`|
39-
| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)|
40-
| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). |
41-
| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` |
42-
| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` |
43-
44-
## LoRA Adapter Serving
45-
46-
### Dynamic LoRA Serving
32+
## [Experimental] LoRA Adapter Serving
4733

4834
Model servers that support dynamic LoRA serving can gain additional benefit from the inference
49-
extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there
50-
is no common standard, the inference extension generally expects the following behavior.
35+
extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.
36+
37+
The inference extension expects the following behavior from compatible model servers.
5138

5239
* Support running multiple LoRA adapters in parallel in the same decode batch.
53-
* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on
40+
* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
5441
the requested adapters in the current batch.
5542

56-
The model server SHOULD expose the following information via an API:
43+
The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :
5744

58-
* AdapterConfig
59-
* LoRAEnabled: Whether dynamic LoRA serving is enabled.
60-
* MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch.
45+
* `Config`
46+
* `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
47+
* `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
6148
Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
62-
requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the
63-
`vllm:lora_requests_info` metric.
64-
* AdapterState
65-
* ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce
66-
requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters`
67-
in the `vllm:lora_requests_info` metric.
68-
69-
The API MAY look like this:
49+
requested adapter.
50+
* `State`
51+
* `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
52+
requests.
53+
54+
This is an example API endpoint and response:
7055
```
7156
GET ${server_endpoint}/adapters/info
7257
```
7358

74-
And the response MAY look like this:
7559
```
7660
{
7761
"config": {
@@ -84,23 +68,5 @@ And the response MAY look like this:
8468
}
8569
```
8670

87-
#### Dynamically Register/Unregister Adapters
88-
89-
Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters).
90-
This enables platform teams to multiplex multiple LoRA adapters on shared model servers and
91-
dynamically rollout LoRA adapters.
92-
93-
NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD
94-
integration.
95-
96-
While we don’t intend to dictate how model servers should implement this API, a reference REST API
97-
MAY look this:
98-
99-
```
100-
POST ${server_endpoint}/adapters/{adapter-id}
101-
{
102-
        "path": "path/to/my/adapter"
103-
}
104-
105-
DELETE ${server_endpoint}/adapters/{adapter-id}
106-
```
71+
NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
72+
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.

0 commit comments

Comments
 (0)