Skip to content

Commit 978ec6c

Browse files
committed
document current lora metrics
1 parent c470005 commit 978ec6c

File tree

1 file changed

+11
-33
lines changed

1 file changed

+11
-33
lines changed

docs/proposals/003-model-server-protocol/protocol.md

Lines changed: 11 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -52,36 +52,14 @@ implementation.
5252
The model servers MUST support serving a LoRA adapter specified in the `model` argument of the
5353
request, provided the requested adapter is valid.
5454

55-
The model server MUST expose the following LoRA adapter information via a RESTful API with response
56-
in JSON :
57-
58-
* `Config`
59-
* `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
60-
* `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to
61-
serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot
62-
load the requested adapter.
63-
* `State`
64-
* `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and
65-
ready to serve requests.
66-
67-
This is an example API endpoint and response:
68-
```
69-
GET ${server_endpoint}/adapters/info
70-
```
71-
72-
```
73-
{
74-
"config": {
75-
"enabled": true,
76-
"maxActiveAdapters": 4,
77-
},
78-
"state": {
79-
"activeAdapters": ["adapter1", "adapter2"]
80-
}
81-
}
82-
```
83-
84-
NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
85-
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma
86-
separated string label `running_lora_adapters`. We will use [this issue](https://github.com/vllm-project/vllm/issues/10086)
87-
to track integration efforts with vLLM.
55+
The model server MUST expose the following LoRA adapter metrics via the same Prometheus endpoint:
56+
57+
* Suggested metric name: `lora_info`, metric name implemented in vLLM: `vllm:lora_requests_info`
58+
* Metric type: Gauge
59+
* Metric value: The last updated timestamp (so the EPP can find the latest).
60+
* Metric labels:
61+
* `max_lora`: The maximum number of adapters that can be loaded to GPU memory to serve a batch.
62+
Requests will be queued if the model server has reached MaxActiveAdapter and canno load the
63+
requested adapter. Example: `"max_lora": "8"`.
64+
* `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
65+
memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`

0 commit comments

Comments
 (0)