|
| 1 | +# Model Server Protocol for Gateway API Inference Extension |
| 2 | + |
| 3 | +## Inference API Protocol |
| 4 | + |
| 5 | +The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions) |
| 6 | +and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to |
| 7 | +supporting more API protocols. |
| 8 | + |
| 9 | +<details> |
| 10 | +<summary>Why?</summary> |
| 11 | +The extension makes intelligent request scheduling decisions based on certain information from the |
| 12 | +request body, such as the `model` field. |
| 13 | +</details> |
| 14 | + |
| 15 | +## Metrics Reporting |
| 16 | + |
| 17 | +The inference extension scrapes metrics from the model servers to make optimal request scheduling |
| 18 | +decisions. The model servers SHOULD provide the following metrics via a Prometheus endpoint. While |
| 19 | +the metric names may differ slightly in different model servers, the metric types MUST be the same. |
| 20 | +We will align with the |
| 21 | +[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) |
| 22 | +effort. |
| 23 | + |
| 24 | +We also show the metrics in vLLM, which is already integrated into the inference extension. We will |
| 25 | +add more model server once they are integrated. |
| 26 | + |
| 27 | +| Metric | Type | Description | vLLM metric | |
| 28 | +| ----- | ---- | ---- | ---- | |
| 29 | +| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`| |
| 30 | +| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`| |
| 31 | + |
| 32 | + |
| 33 | +### Future Metrics |
| 34 | +The following metrics MAY be needed in the future for further optimization. |
| 35 | + |
| 36 | +| Metric |Type | Description | vLLM metric | |
| 37 | +| ----- | ---- | ---- | ---- | |
| 38 | +| TotalTokensInCurrentBatch | Gauge | Number of tokens in the current batch.| `vllm:num_tokens_running`| |
| 39 | +| TotalQueuedTokens| Gauge | The current total number of tokens in the queued requests.| `vllm:num_tokens_waiting` (need to be added)| |
| 40 | +| MaxTokenCapacity| Gauge | The total size of the KV cache in number of tokens.| `vllm:max_token_capacity` <br> NOTE: This info is available indirectly in [`cache_config_info`](https://github.com/vllm-project/vllm/blob/15702038642192002cd8973cf8948751b750fd07/vllm/engine/metrics.py#L551) metric already , and also proposed in [model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk). | |
| 41 | +| TimePerPrefillToken | Histogram | The prefill latency per token in the last W seconds. W will be decided by simulation/benchmarking. In time series metric the latency is typically reported as Histogram and we can derive the average from the Histogram. | `vllm:time_to_first_token_seconds` | |
| 42 | +| TimePerDecodeToken | Histogram | The decode latency per token in the last W seconds. W will be decided by simulation/benchmarking. | `vllm:time_per_output_token_seconds` | |
| 43 | + |
| 44 | +## LoRA Adapter Serving |
| 45 | + |
| 46 | +### Dynamic LoRA Serving |
| 47 | + |
| 48 | +Model servers that support dynamic LoRA serving can gain additional benefit from the inference |
| 49 | +extension's LoRA affinity algorithm. While dynamic LoRA serving is quite new and evolving, and there |
| 50 | +is no common standard, the inference extension generally expects the following behavior. |
| 51 | + |
| 52 | +* Support running multiple LoRA adapters in parallel in the same decode batch. |
| 53 | +* Dynamically load/unload adapters in GPU memory from/to a cahe (e.g., in host memory) depending on |
| 54 | + the requested adapters in the current batch. |
| 55 | + |
| 56 | +The model server SHOULD expose the following information via an API: |
| 57 | + |
| 58 | +* AdapterConfig |
| 59 | + * LoRAEnabled: Whether dynamic LoRA serving is enabled. |
| 60 | + * MaxActiveAdapter: Maximum number of adapters that can be loaded to GPU memory to serve a batch. |
| 61 | + Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the |
| 62 | + requested adapter. In vLLM, this is currently exposed as a string label `max_lora` in the |
| 63 | + `vllm:lora_requests_info` metric. |
| 64 | +* AdapterState |
| 65 | + * ActiveAdapters: A list of adapters that are currently loaded in GPU memory and ready to servce |
| 66 | + requests. In vLLM, this is currently exposed as a comma separated string label `running_lora_adapters` |
| 67 | + in the `vllm:lora_requests_info` metric. |
| 68 | + |
| 69 | +The API MAY look like this: |
| 70 | +``` |
| 71 | +GET ${server_endpoint}/adapters/info |
| 72 | +``` |
| 73 | + |
| 74 | +And the response MAY look like this: |
| 75 | +``` |
| 76 | +{ |
| 77 | + "config": { |
| 78 | + "enabled": true, |
| 79 | + "maxActiveAdapters": 4, |
| 80 | + }, |
| 81 | + "state": { |
| 82 | + "activeAdapters": ["adapter1", "adapter2"] |
| 83 | + } |
| 84 | +} |
| 85 | +``` |
| 86 | + |
| 87 | +#### Dynamically Register/Unregister Adapters |
| 88 | + |
| 89 | +Model servers SHOULD have APIs to dynamically register/unregister models (usually LoRA adapters). |
| 90 | +This enables platform teams to multiplex multiple LoRA adapters on shared model servers and |
| 91 | +dynamically rollout LoRA adapters. |
| 92 | + |
| 93 | +NOTE this is not a strict requirement from the inference extension, but a critical feature for CI/CD |
| 94 | +integration. |
| 95 | + |
| 96 | +While we don’t intend to dictate how model servers should implement this API, a reference REST API |
| 97 | +MAY look this: |
| 98 | + |
| 99 | +``` |
| 100 | +POST ${server_endpoint}/adapters/{adapter-id} |
| 101 | +{ |
| 102 | + "path": "path/to/my/adapter" |
| 103 | +} |
| 104 | +
|
| 105 | +DELETE ${server_endpoint}/adapters/{adapter-id} |
| 106 | +``` |
0 commit comments