@@ -52,36 +52,14 @@ implementation.
52
52
The model servers MUST support serving a LoRA adapter specified in the ` model ` argument of the
53
53
request, provided the requested adapter is valid.
54
54
55
- The model server MUST expose the following LoRA adapter information via a RESTful API with response
56
- in JSON :
57
-
58
- * ` Config `
59
- * ` LoRAEnabled ` : boolean, whether dynamic LoRA serving is enabled.
60
- * ` MaxActiveAdapter ` : integer, the maximum number of adapters that can be loaded to GPU memory to
61
- serve a batch. Requests will be queued if the model server has reached MaxActiveAdapter and cannot
62
- load the requested adapter.
63
- * ` State `
64
- * ` ActiveAdapters ` : List[ string] , a list of adapters that are currently loaded in GPU memory and
65
- ready to serve requests.
66
-
67
- This is an example API endpoint and response:
68
- ```
69
- GET ${server_endpoint}/adapters/info
70
- ```
71
-
72
- ```
73
- {
74
- "config": {
75
- "enabled": true,
76
- "maxActiveAdapters": 4,
77
- },
78
- "state": {
79
- "activeAdapters": ["adapter1", "adapter2"]
80
- }
81
- }
82
- ```
83
-
84
- NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the ` vllm:lora_requests_info ` metric, where
85
- ` MaxActiveAdapters ` is exposed as a string label ` max_lora ` , and ` ActiveAdapters ` as a comma
86
- separated string label ` running_lora_adapters ` . We will use [ this issue] ( https://github.com/vllm-project/vllm/issues/10086 )
87
- to track integration efforts with vLLM.
55
+ The model server MUST expose the following LoRA adapter metrics via the same Prometheus endpoint:
56
+
57
+ * Suggested metric name: ` lora_info ` , metric name implemented in vLLM: ` vllm:lora_requests_info `
58
+ * Metric type: Gauge
59
+ * Metric value: The last updated timestamp (so the EPP can find the latest).
60
+ * Metric labels:
61
+ * ` max_lora ` : The maximum number of adapters that can be loaded to GPU memory to serve a batch.
62
+ Requests will be queued if the model server has reached MaxActiveAdapter and canno load the
63
+ requested adapter. Example: ` "max_lora": "8" ` .
64
+ * ` running_lora_adapters ` : A comma separated list of adapters that are currently loaded in GPU
65
+ memory and ready to serve requests. Example: ` "running_lora_adapters": "adapter1, adapter2" `
0 commit comments