Skip to content

Add Endpoint Picker Protocol Proposal #164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 29, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions docs/proposals/003-model-server-protocol/protocol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Model Server Protocol for Gateway API Inference Extension

## Inference API Protocol

The model server MUST implement OpenAI’s [Completions](https://platform.openai.com/docs/api-reference/completions)
and [Chat](https://platform.openai.com/docs/api-reference/chat) API. In the future we are open to
supporting more API protocols.

<details>
<summary>Why?</summary>
The extension makes intelligent request scheduling decisions based on certain information from the
request body, such as the `model` field.
</details>

## Metrics Reporting

The inference extension scrapes metrics from the model servers to make optimal request scheduling
decisions. The model servers MUST provide the following metrics via a Prometheus endpoint. While
the metric names may differ slightly in different model servers, the metric types MUST be the same.
We will align with the
[model server metrics standardization](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
effort.

We also show the metrics in vLLM, which is already integrated into the inference extension. We will
add more model servers once they are integrated.

| Metric | Type | Description | vLLM metric |
| ----- | ---- | ---- | ---- |
| TotalQueuedRequests | Gauge | The current total number of requests in the queue.| `vllm:num_requests_waiting`|
| KVCacheUtilization| Gauge | The current KV cache utilization in percentage.| `vllm:gpu_cache_usage_perc`|

## [Experimental] LoRA Adapter Serving

Model servers that support dynamic LoRA serving can gain additional benefit from the inference
extension's LoRA affinity algorithm. As dynamic LoRA serving is quite new and evolving, this part is considered experimental and subject to changes in future releases.

The inference extension expects the following behavior from compatible model servers.

* Support running multiple LoRA adapters in parallel in the same decode batch.
* Dynamically load/unload adapters in GPU memory from/to a cache (e.g., in host memory) depending on
the requested adapters in the current batch.

The model server MUST expose the following LoRA adapter information via a RESTful API with response in JSON :

* `Config`
* `LoRAEnabled`: boolean, whether dynamic LoRA serving is enabled.
* `MaxActiveAdapter`: integer, the maximum number of adapters that can be loaded to GPU memory to serve a batch.
Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the
requested adapter.
* `State`
* `ActiveAdapters`: List[string], a list of adapters that are currently loaded in GPU memory and ready to serve
requests.

This is an example API endpoint and response:
```
GET ${server_endpoint}/adapters/info
```

```
{
"config": {
"enabled": true,
"maxActiveAdapters": 4,
},
"state": {
"activeAdapters": ["adapter1", "adapter2"]
}
}
```

NOTE: Currently in vLLM v0.6.6, LoRA info is exposed in the `vllm:lora_requests_info` metric, where
`MaxActiveAdapters` is exposed as a string label `max_lora`, and `ActiveAdapters` as a comma separated string label `running_lora_adapters`.