|
| 1 | + |
| 2 | +# LLM Instance Gateway |
| 3 | +<!-- toc --> |
| 4 | + |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | + - [Goals](#goals) |
| 8 | + - [Non-Goals](#non-goals) |
| 9 | +- [Proposal](#proposal) |
| 10 | + - [Gateway](#gateway) |
| 11 | + - [CRDs](#crds) |
| 12 | + - [Envoy |
| 13 | + Solution](#envoy-solution) |
| 14 | + - [Model Server Protocol](#model-server-protocol) |
| 15 | +- [PoC Design Details](#poc-design-details) |
| 16 | + - [Overview](#overview) |
| 17 | + - [Request Flow](#request-flow) |
| 18 | + - [Pod selection algorithm in PoC](#pod-selection-algorithm-in-poc) |
| 19 | + - [Artifacts](#artifacts) <!-- /toc --> |
| 20 | + |
| 21 | +## Summary |
| 22 | + |
| 23 | +As presented in the [demo](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) and building further upon the [joint proposal](https://docs.google.com/document/d/1BkwDlgFxSKKPHhM9kS28CdDIyJ3Xkdue3Iw1INaUkGw/edit?tab=t.0#heading=h.ajlsibmfh8wr), we are proposing that a gateway, focused on |
| 24 | +multiplexing |
| 25 | +use cases upon shared hardware has distinct advantages in enabling efficient and fair use of multiple use-cases over a shared pool of compute. |
| 26 | + |
| 27 | +## Motivation |
| 28 | + |
| 29 | +Novel advancements in fine-tuning like [LoRA](https://arxiv.org/abs/2106.09685) and [Multi-LoRA](https://arxiv.org/abs/2310.18547) have enabled multiple distinct use cases to share accelerators. As this new tech is adopted, the Day1/2 operational concerns quickly become necessary. |
| 30 | + |
| 31 | +Kubernetes as long been a standard in easing and automating operational tasks of |
| 32 | +workloads. A mechanism (gateway) within the K8s ecosystem is a |
| 33 | +reasonable, and expected way for a user to support multiple LLM use cases on shared |
| 34 | +accelerators. |
| 35 | + |
| 36 | +### Goals |
| 37 | + |
| 38 | +#### Proposal Goals |
| 39 | + |
| 40 | +- Create an Inference Gateway project group for wg-serving collaboration, |
| 41 | + including: chat channel & dedicated repo (sponsored by sig-network) |
| 42 | + |
| 43 | +#### Gateway Goals |
| 44 | + |
| 45 | +- Fast reconfiguration - New use cases (including LoRA adapters or client |
| 46 | + configuration) can be rolled out / back in seconds to clients without waiting for |
| 47 | + a new model server to start. |
| 48 | +- Efficient accelerator sharing - Use cases can use less than an accelerator |
| 49 | + or temporarily burst without needing to start a new model server leading to |
| 50 | + fewer wasted accelerators and better pooling of shared capacity. |
| 51 | +- Operational resilience - Use cases share available accelerators fairly and |
| 52 | + can have distinct priorities, latency objectives, and failure policies. |
| 53 | +- Standardized LoRA - Simple recommended patterns for deploying and loading |
| 54 | + LoRA adapters on a wide range of Kubernetes environments into model servers. |
| 55 | +- Composability - Approach should be composable with: |
| 56 | + - K8s Gateway API |
| 57 | +- Other gateway features and projects, including high level LLM gateways |
| 58 | + - existing deployment tools like kserve or kaito |
| 59 | + - different model servers |
| 60 | + |
| 61 | +### Non-Goals |
| 62 | + |
| 63 | + |
| 64 | +#### Proposal Non-Goals |
| 65 | +- Creation of a fully realized KEP |
| 66 | + |
| 67 | +#### Gateway Non-Goals |
| 68 | + |
| 69 | +- Replacing the features of pre-existing Gateways |
| 70 | +- Defining how serving workloads must be deployed |
| 71 | + |
| 72 | +## Proposal |
| 73 | + |
| 74 | +### Gateway |
| 75 | + |
| 76 | +#### CRD(s) |
| 77 | + |
| 78 | +To adequately achieve the above goals, we propose the addition of 1 or more CRDs |
| 79 | +to express: |
| 80 | + |
| 81 | +- The boundaries of a compute pool that shares a base model |
| 82 | + - Including the deployment of a routing solution (PoC details below) |
| 83 | +- A specific use case upon one or more backend pools |
| 84 | + - The objectives that this use case needs to achieve |
| 85 | + |
| 86 | +The example API we showed in our demo looked like: |
| 87 | + |
| 88 | +``` |
| 89 | +kind: LLMRoute |
| 90 | +apiVersion: inference.x-k8s.io/v1alpha1 |
| 91 | +metadata: |
| 92 | + name: assistant |
| 93 | +spec: |
| 94 | + parentRefs: |
| 95 | + - name: ai-gw |
| 96 | + backendRefs: |
| 97 | + - name: assistant |
| 98 | + adapter: |
| 99 | + name: sentiment |
| 100 | + priority: 100 |
| 101 | + objectives: |
| 102 | + - type: OutputTokenLatency |
| 103 | +
|
| 104 | + latency: |
| 105 | + value: 2s |
| 106 | + quantile: |
| 107 | + numerator: 99 |
| 108 | + metrics: |
| 109 | + |
| 110 | + format: Prometheus |
| 111 | +``` |
| 112 | + |
| 113 | +#### Envoy Solution |
| 114 | + |
| 115 | +Any gateway solution *must* be compatible with Envoy Proxy, and have a plan with |
| 116 | +how to integrate these features into the Envoy ecosystem over the long term. |
| 117 | + |
| 118 | +#### Model Server Protocol |
| 119 | + |
| 120 | +In the PoC investigation we discovered the need for certain control and data to |
| 121 | +be exposed by the model server. In order for a model server to work properly |
| 122 | +with this LLM Instance Gateway, the model server would need to implement this |
| 123 | +protocol. |
| 124 | + |
| 125 | +Key requirements would roughly look like: |
| 126 | +- A method, or set of methods to dynamically update the available LoRA catalog on a model server |
| 127 | +- Metrics, shared as a header on response data, or some other similar mechanism, for data like: |
| 128 | + - Networking-friendly metric share (shared as a header, or other |
| 129 | +lightweight mechanism, just not in the body) |
| 130 | + - Adapter State |
| 131 | + - Available catalog |
| 132 | + - Queue data (per adapter) |
| 133 | + |
| 134 | + |
| 135 | +## PoC Design |
| 136 | + |
| 137 | +From the proof of concept we believe the following architecture is a starting point for this proposal: |
| 138 | + |
| 139 | +- Envoy Proxy |
| 140 | + - An OSS starting point that is generally accepted and used |
| 141 | +- Ext proc |
| 142 | + - A necessary tool to extend the capabilities of Envoy to allow for routing based on the Open AI model field (within the body) |
| 143 | + - An agile tool for development of novel LLM Instance Gateway features |
| 144 | +- CRD/K8s API interface |
| 145 | +- Model server modifications |
| 146 | + - Necessary to extend existing tooling to provide the proper routing data to Envoy |
| 147 | + - Potentially extend further to support [ORCA](https://github.com/envoyproxy/envoy/issues/6614) headers as a method of metrics transfer |
| 148 | + |
| 149 | +### Overview |
| 150 | + |
| 151 | +Our very high level diagram of how this looked: |
| 152 | + |
| 153 | + |
| 154 | +To briefly describe how the components work together: |
| 155 | + |
| 156 | +- When an `LLMRoute` is defined, our gateway recognizes this new service, and |
| 157 | + allows traffic for the specified adapter to be admitted to the backend pool. |
| 158 | + - We support and expect Open AI API spec as the default when reading the |
| 159 | + adapter. |
| 160 | + |
| 161 | +- Incoming traffic for a validated service is then routed to ExtProc, where |
| 162 | + routing and fairness decisions are made. |
| 163 | + |
| 164 | +- We attempt to route to a model server that has the adapter already loaded, |
| 165 | + so long as there is batch capacity |
| 166 | + |
| 167 | + |
| 168 | +### Request Flow |
| 169 | + |
| 170 | +Below is an example of a |
| 171 | +life of a request using this described design: |
| 172 | + |
| 173 | + |
| 174 | +> Notes: |
| 175 | +> |
| 176 | +> 1. Ext Proc: External processing calls an external gRPC service to |
| 177 | +> process HTTP requests and responses |
| 178 | +> |
| 179 | +> 2. Original Dst: Original destination |
| 180 | +> cluster can be used when incoming connections are redirected to Envoy either |
| 181 | +> via an iptables REDIRECT or TPROXY target or with Proxy Protocol. In these |
| 182 | +> cases requests routed to an original destination cluster are forwarded to |
| 183 | +> upstream hosts as addressed by the redirection metadata, without any explicit |
| 184 | +> host configuration or upstream host discovery. We implemented this using the |
| 185 | +> bootstrap feature of Envoy Gateway |
| 186 | +
|
| 187 | +### Pod selection algorithm in PoC |
| 188 | + |
| 189 | +Metrics stored in Ext Proc Cache: |
| 190 | +- Active adapters in Each pod |
| 191 | +- Number of pending requests in each adapter in each pod. |
| 192 | + |
| 193 | +Given a request, read the relevant metrics from the cache and find which pods have that lora adapter loaded. |
| 194 | +Out of the set of pods that have the lora adapter loaded and the number of pending requests in that adapter is below a threshold, pick the one with the |
| 195 | +most amount of pending requests (we pick the most to prevent flopping). |
| 196 | +- If no pods satisfy 1 or 2 then pick a pod with: (in following priority): |
| 197 | + 1. Least number of active adapters. |
| 198 | + 1. Least total pending requests |
| 199 | + |
| 200 | +### Artifacts: |
| 201 | + |
| 202 | +- [Ext-proc/Envoy/Benchmarking repo](https://github.com/tomatillo-and-multiverse/lora-inference-gateway) |
| 203 | + - Repo we used to develop the ext proc image used in the PoC |
| 204 | + - Also contains the manifests required to deploy gateway |
| 205 | +- [vLLM fork](https://github.com/kaushikmitr/vllm) |
| 206 | +- Presentation: |
| 207 | + - [Slides](https://docs.google.com/presentation/d/1I1XDf6fQQEtHxJtZxFdIaUcUA3lLBC7neW823diWS78/edit?usp=sharing) |
| 208 | + - [Recording](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) |
| 209 | + - [PoC Design & Experimentation data](https://docs.google.com/document/d/17wB0BgeV8JrGtccxZqkOqFyNC4gPBNqdKg8Oe9xMkio/edit#heading=h.eeeqp85g68qy) |
0 commit comments