Skip to content

Commit a5e340c

Browse files
committed
Add model server protocol proposal
1 parent adad31c commit a5e340c

File tree

1 file changed

+106
-0
lines changed

1 file changed

+106
-0
lines changed
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Model Server Protocol for Gateway API Inference Extension
2+
3+
## Inference API Protocol
4+
5+
The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
6+
and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
7+
supporting more API protocols.
8+
9+
<details>
10+
<summary>Why?</summary>
11+
The extension makes intelligent request scheduling decisions based on certain information from the
12+
request body, such as the `model` field.
13+
</details>
14+
15+
## Metrics Reporting
16+
17+
The inference extension scrapes metrics from the model servers to make optimal request scheduling
18+
decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While
19+
the metric names may differ slightly in different model servers, the metric types MUST be the same.
20+
We will align with the
21+
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
22+
effort.
23+
24+
We also show the metrics in vLLM, which is already integrated into the inference extension. We will
25+
add more model server once they are integrated.
26+
27+
| Metric | Type | Description | vLLM metric |
28+
| ----- | ---- | ---- | ---- |
29+
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
30+
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|
31+
32+
33+
### Future Metrics
34+
The following metrics MAY be needed in the future for further optimization.
35+
36+
| Metric |Type | Description | vLLM metric |
37+
| ----- | ---- | ---- | ---- |
38+
| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`|
39+
| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)|
40+
| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). |
41+
| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` |
42+
| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` |
43+
44+
## LoRA Adapter Serving
45+
46+
### Dynamic LoRA Serving
47+
48+
Model servers that support dynamic LoRA serving can gain additional benefit from the inference
49+
extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there
50+
is no common standard, the inference extension generally expects the following behavior.
51+
52+
* Support running multiple LoRA adapters in parallel in the same decode batch.
53+
* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on
54+
the requested adapters in the current batch.
55+
56+
The model server SHOULD expose the following information via an API:
57+
58+
* AdapterConfig
59+
* LoRAEnabled: Whether dynamic LoRA serving is enabled.
60+
* MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch.
61+
Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
62+
requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the
63+
`vllm:lora_requests_info` metric.
64+
* AdapterState
65+
* ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce
66+
requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters`
67+
in the `vllm:lora_requests_info` metric.
68+
69+
The API MAY look like this:
70+
```
71+
GET ${server_endpoint}/adapters/info
72+
```
73+
74+
And the response MAY look like this:
75+
```
76+
{
77+
"config": {
78+
"enabled": true,
79+
"maxActiveAdapters": 4,
80+
},
81+
"state": {
82+
"activeAdapters": ["adapter1", "adapter2"]
83+
}
84+
}
85+
```
86+
87+
#### Dynamically Register/Unregister Adapters
88+
89+
Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters).
90+
This enables platform teams to multiplex multiple LoRA adapters on shared model servers and
91+
dynamically rollout LoRA adapters.
92+
93+
NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD
94+
integration.
95+
96+
While we don’t intend to dictate how model servers should implement this API, a reference REST API
97+
MAY look this:
98+
99+
```
100+
POST ${server_endpoint}/adapters/{adapter-id}
101+
{
102+
        "path": "path/to/my/adapter"
103+
}
104+
105+
DELETE ${server_endpoint}/adapters/{adapter-id}
106+
```

0 commit comments

Comments
 (0)