diff --git a/docs/proposals/002-api-proposal/glossary.md b/docs/proposals/002-api-proposal/glossary.md deleted file mode 100644 index 33e0e6175..000000000 --- a/docs/proposals/002-api-proposal/glossary.md +++ /dev/null @@ -1,94 +0,0 @@ -# Glossary - -This is a glossary that attempts to more thoroughly explain terms used within the api proposal, in an effort to give context to API decisions. - - -- [API Terms](#api) - - [LLMServerPool](#llmserverpool) - - [LLMService](#llmservice) -- [Capacity Constrained Routing](#capacity-constrained-routing) - - [Priority](#priority) - - [Fairness](#fairness) -- [General Routing](#general-routing) - - [Latency Based Routing](#latency-based-routing) - - [Lora Affinity](#lora-affinity) - - - - -## API -This is a very brief description of terms used to describe API objects, included for completeness. - -### LLMServerPool -A grouping of model servers that serve the same set of fine-tunes (LoRA as a primary example). - -Shortened to: `LSP` - -### LLMService -An LLM workload that is defined and runs on a LLMServerPool with other use cases. - -# Capacity Constrained Routing - -## Priority - -### Summary -Priority specifies the importance of a LLMService relative to other services within a LLMServerPool. - -### Description - -For our purposes, priority can be thought of in two classes: -- Critical -- Non-Critical - -The primary difference is that non-critical LLMService requests will be rejected in favor of Critical LLMServices the face of resource scarcity. - -Example: - -Your current request load is using 80 Arbitrary Compute Units(ACU) of your pools total of 100ACU capacity. 40ACU are critical workload requests, 40 are non-critical. If you were to lose 30 ACU due to an unforseen outage. Priority would dictate that of the 10 surplus ACU to be rejected, the entirety of them would be from the _non-critical_ requests. - -## Fairness - -### Summary -Fairness specifies how resources are shared among different LLMServices, in a way that is most acceptable to the user. - -### Description - -Fairness, like priority, is only used in resource scarcity events. - -Fairness is utilized when requests of the same priority class need to be rejected, or queued. There are many dimensions that could be considered when considering shared resources. To name a few: -- KV-cache utilization -- Total request count -- SLO adherence - -For the v1 MVP, the only objective a User can specify is the SLO objective they would like to meet. So, in following that pattern, fairness in MVP will simply be considered for SLO adherence. SLO Adherence is only being considered over a rolling time window of data. - -The TTL we are currently assuming is: `5 min` - -### Example - -**Assumption:** Services have equally weighted fairness for this example. - -- Service A has been meeting its SLO 98% of the requests made in the time window, and Service B has met the SLO 94% of the time. - -- A request for both Service A and Service B come in at the same time, and there is only capacity to start a single new request in the LSP, this capacity would meet the SLO for both services. The other request would be queued (potentially causing that request to not meet SLO). - -- To fairly share these resources. Service B *must* be selected to begin the request immediately as Service A has had its SLO met a larger percentage of the time. - -# General Routing -Different from the previous definitons, these terms are used to describe methods of routing that are constant, and seek to better utilize compute resources to avoid capacity constraints as much as possible. - -## Latency Based Routing - -### Summary -Latency Based Routing uses data to ensure LLMServices meet their specified SLO. - -### Description -Data collected from the model servers and data collected from the request is used to predict the time a request will take on a *specific* model server, and route in a way that will best satisfy the SLO of the incoming requests. - -## Lora Affinity - -### Summary -LoRA Affinity describes the routing strategy displayed in the [demo](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458), to better utilize Model Servers within the LSP. - -### Description -Model Servers that support multi-LoRA handle requests in a FCFS basis. By utilizing the data provided by the model server (the state of loaded LoRA adapters), a routing system can route requests for a given LoRA adapter, to a model server that already has that adapter loaded, to create larger batches than a naive route, which better utilizes the model server hardware. \ No newline at end of file diff --git a/docs/proposals/002-api-proposal/proposal.md b/docs/proposals/002-api-proposal/proposal.md index 988ee3102..8f406d867 100644 --- a/docs/proposals/002-api-proposal/proposal.md +++ b/docs/proposals/002-api-proposal/proposal.md @@ -1,5 +1,5 @@ -# LLM Instance Gateway +# Gateway API Inference Extension ## Proposal Status ***Draft*** @@ -14,10 +14,10 @@ - [Proposal](#proposal) - [Personas](#personas) - [Inference Platform Admin](#inference-platform-admin) - - [LLM Service Owner](#llm-service-owner) + - [Inference Workload Owner](#workload-owner) - [Axioms](#axioms) - - [LLMServerPool](#llmserverpool) - - [LLMService](#llmservice) + - [InferencePool](#inferencepool) + - [InferenceModel](#inferencemodel) - [Spec](#spec) - [Diagrams](#diagrams) - [Alternatives](#alternatives) @@ -28,13 +28,12 @@ ## Summary -This proposal presents 2 new CRD objects to express the needs of the LLM Instance Gateway. **LLMServerPool** and **LLMService** (names up for debate). The LLMServerPool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the LLMService defines the serving objectives of a specific model or LoRA adapter, and is owned by the LLM Service Owner. +This proposal presents 2 new CRD objects to express the needs of the Gateway API Inference Extension. **InferencePool** and **InferenceModel**. The InferencePool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the InferenceModel defines the serving objectives of a specific model or LoRA adapter, and is owned by the Inference Workload Owner. -**NOTE: Some routing terms are defined in the [glossary](./glossary.md) file, to more deeply describe how we will handle behaviors like priority and fairness** ## Goals -- Drive concensus on direction of LLM Instance Gateway Solution +- Drive concensus on direction of Gateway API Inference Extension Solution - Documentation of API decisions for posterity ## Non-Goals @@ -58,10 +57,10 @@ The Inference Platform Admin creates and manages the infrastructure necessary to - Gateway configuration - etc -#### LLM Service Owner +#### Inference Workload Owner -An LLM Service Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes: -- Defining SLO +An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes: +- Defining criticality - Managing fine-tunes - LoRA Adapters - System Prompts @@ -80,101 +79,122 @@ The API design is based on these axioms: - The MVP will heavily assume requests are done using the OpenAI spec, but open to extension in the future - The Gateway should route in a way that does not generate a queue of requests at the model server level -The [PoC](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) was focused on lower-level scheduling. And the API follows that similar logic, which lead to the proposal of the **LLMServerPool**. +The [PoC](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) was focused on lower-level scheduling. And the API follows that similar logic, which lead to the proposal of the **InferencePool**. -### LLMServerPool +### InferencePool -The LLMServerPool at its core is a logical grouping of compute, expressed in the form of Pods (typically model servers), akin to a K8s Service. The LLMServerPool would deploy its own routing, and offer administrative configuration to the Platform Admin. +The InferencePool at its core is a logical grouping of compute, expressed in the form of Pods (typically model servers), akin to a K8s Service. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin. - It is expected for the LLMServerPool to: - - Enforce fair consumption of resources across competing services + It is expected for the InferencePool to: + - Enforce fair consumption of resources across competing workloads - Efficiently route requests across shared compute (as displayed by the PoC) -It is _not_ expected for the LLMServerPool to: +It is _not_ expected for the InferencePool to: - Enforce any common set of adapters or base models are available on the Pods - Manage Deployments of Pods within the Pool - Manage Pod lifecycle of pods within the pool -Additionally, any Pod that seeks to join a LLMServerPool would need to support a protocol, defined by LLM Instance Gateway, to ensure the Pool has adequate information to intelligently route requests. +Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests. -### LLMService +### InferenceModel -A LLMService allows the LLM Service Owner to define: -- Which LoRA adapter(s) to consume - - LLMService allows for traffic splitting between adapters _in the same LLMServerPool_ to allow for new LoRA adapter versions to be easily rolled out -- SLO objectives for the LLMService -- The Pools this LLMService is relevant to +An InferenceModel allows the Inference Workload Owner to define: +- Which Model/LoRA adapter(s) to consume . + - Mapping from a client facing model name to the target model name in the InferencePool. + - InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out. +- Criticality of the requests to the InferenceModel. +- The InferencePools this InferenceModel is relevant to. ### Spec -**LLMService** +**InferencePool** ```golang -// LLMService represents a set of LLM services that are multiplexed onto one -// or more backend pools. This resource is managed by the "LLM Service Owner" -// persona. The Service Owner persona is: a team that trains, verifies, and +// The InferencePool is a construct for pooling compute (often model servers) to +// serve large models, that have the ability to share capacity across multiple +// services (such as through prompt engineering, LoRA adapters, etc). +// InferencePools have a dependency on a Gateway that is compatible with ext-proc +// (External Processing). When a new InferencePool object is created, a new ext proc +// deployment is created. InferencePools require at minimum a single InferenceModel to +// be subscribed to them to accept traffic, any traffic with a model not +// defined within an InferenceModel will be rejected. +type InferencePool struct { + metav1.ObjectMeta + metav1.TypeMeta + + Spec InferencePoolSpec +} + +type InferencePoolSpec struct { + // ModelServerSelector uses label selection to watch model server pods + // that should be included in the InferencePool. ModelServers should not + // be with any other Service or InferencePool, that behavior is not supported + // and will result in sub-optimal utilization. + ModelServerSelector map[string]string `json:"modelServerSelector,omitempty"` +} +``` + +**InferenceModel** +```golang +// InferenceModel represents a set of Models/Adapters that are multiplexed onto one +// or more Inferencepools. This resource is managed by the "Inference Workload Owner" +// persona. The Inference Workload Owner persona is: a team that trains, verifies, and // leverages a large language model from a model frontend, drives the lifecycle // and rollout of new versions of those models, and defines the specific -// performance and latency goals for the model. These services are -// expected to operate within a LLMServerPool sharing compute capacity with other -// LLMServices, defined by the Inference Platform Admin. We allow a user who -// has multiple LLMServices across multiple pools (with the same config) to +// performance and latency goals for the model. These workloads are +// expected to operate within an InferencePool sharing compute capacity with other +// InferenceModels, defined by the Inference Platform Admin. We allow a user who +// has multiple InferenceModels across multiple pools (with the same config) to // specify the configuration exactly once, and deploy to many pools // simultaneously. Enabling a simpler config and single source of truth -// for a given user. LLMService names are unique for a given LLMServerPool, -// if the name is reused, an error will be shown on the status of a -// LLMService that attempted to reuse. The oldest LLMService, based on -// creation timestamp, will be selected to remain valid. In the event of a race -// condition, one will be selected at random. -type LLMService struct { +// for a given user. InferenceModel ModelNames are unique for a given InferencePool, +type InferenceModel struct { metav1.ObjectMeta metav1.TypeMeta - Spec LLMServiceSpec -} - -type LLMServiceSpec struct { - // Defines the distinct services. - // Model can be in 2 priority classes, Critical and Noncritical. - // Priority class is implicitly set to Critical by specifying an Objective. - // Otherwise the Model is considered Noncritical. - Models []Model - // Reference to the backend pools that the services registers to. - PoolRef []corev1.ObjectReference + Spec InferenceModelSpec } -// Model defines the policies for routing the traffic of a use case, this includes performance objectives -// and traffic splitting between different versions of the model. -type Model struct { +type InferenceModelSpec struct { // The name of the model as the users set in the "model" parameter in the requests. - // The name should be unique among the services that reference the same backend pool. + // The name should be unique among the workloads that reference the same backend pool. // This is the parameter that will be used to match the request with. In the future, we may // allow to match on other request parameters. The other approach to support matching on // on other request parameters is to use a different ModelName per HTTPFilter. // Names can be reserved without implementing an actual model in the pool. // This can be done by specifying a target model and setting the weight to zero, // an error will be returned specifying that no valid target model is found. - Name string + ModelName string // Optional - // LLM Services with an objective have higher priority than services without. - // IMPORTANT: By specifying an objective, this places the LLMService in a higher priority class than LLMServices without a defined priority class. - // In the face of resource-scarcity. Higher priority requests will be preserved, and lower priority class requests will be rejected. - Objective *Objective + // Defines how important it is to serve the model compared to other models referencing the same pool. + Criticality *Criticality // Optional. - // Allow multiple versions of a model for traffic splitting. - // If not specified, the target model name is defaulted to the modelName parameter. - // modelName is often in reference to a LoRA adapter. + // Allow multiple versions of a model for traffic splitting. + // If not specified, the target model name is defaulted to the ModelName parameter. + // ModelName is often in reference to a LoRA adapter. TargetModels []TargetModel + // Reference to the InferencePool that the model registers to. It must exist in the same namespace. + PoolReference *LocalObjectReference } - +// Defines how important it is to serve the model compared to other models. +type Criticality string +const ( + // Most important. Requests to this band will be shed last. + Critical Criticality = "Critical" + // More important than Sheddable, less important than Critical. + // Requests in this band will be shed before critical traffic. + Default Criticality = "Default" + // Least important. Requests to this band will be shed before all other bands. + Sheddable Criticality = "Sheddable" + ) // TargetModel represents a deployed model or a LoRA adapter. The // Name field is expected to match the name of the LoRA adapter -// (or base model) as it is registered within the model server. Inference -// Gateway assumes that the model exists on the model server and is the +// (or base model) as it is registered within the model server. This +// assumes that the model exists on the model server and it is the // responsibility of the user to validate a correct match. Should a model fail -// to exist at request time, the error is processed by the Instance Gateway, -// and then emitted on the appropriate LLMService object. +// to exist at request time, the error is processed by the extension, +// and then emitted on the appropriate InferenceModel object status. type TargetModel struct { // The name of the adapter as expected by the ModelServer. Name string @@ -183,161 +203,91 @@ type TargetModel struct { Weight int } -// Objective captures the latency SLO of a LLM service. -// In MVP, meeting the SLO is on a best effort basis. -// Future: Extend the API for different behaviors of meeting the SLO. -// The gateway will perform best-effort load balancing, and work with other components (e.g., autoscaler) to meet the -// objectives. -type Objective struct { - // The AverageLatencyPerOutputToken is calculated as the e2e request latency divided by output token - // length. Note that this is different from what is known as TPOT (time per output token) which only - // takes decode time into account. - // The P95 is calculated over a fixed time window defined at the operator level. - DesiredAveragePerOutputTokenLatencyAtP95OverMultipleRequests - *time.Duration -} -``` +// LocalObjectReference identifies an API object within the namespace of the +// referrer. +type LocalObjectReference struct { + // Group is the group of the referent. + Group Group -**LLMServerPool** -```golang -// The LLMServerPool is a construct for pooling compute (often model servers) to -// serve large models, that have the ability to share capacity across multiple -// services (such as through prompt engineering, LoRA adapters, etc). -// LLMServerPools have a dependency on a Gateway that is compatible with ext-proc -// (External Processing). When a new LSP object is created, a new ext proc -// deployment is created. LLMServerPools require at minimum a single LLMService to -// be subscribed to them to accept traffic, any traffic with a model not -// definied within a LLMService will be rejected. -type LLMServerPool struct { - metav1.ObjectMeta - metav1.TypeMeta + // Kind is kind of the referent. For example "InferencePool". + Kind Kind - Spec LLMServerPoolSpec + // Name is the name of the referent. + Name ObjectName } -type LLMServerPoolSpec struct { - // ModelServerSelector uses label selection to watch model server pods - // that should be included in the LLMServerPool. ModelServers should not - // be with any other Service or LLMServerPool, that behavior is not supported - // and will result in sub-optimal utilization. - ModelServerSelector map[string]string `json:"modelServerSelector,omitempty"` -} ``` ### Yaml Examples -#### LLMServerPool(s) -Here we create 2 LSPs that subscribe to services to collect the appropriate pods +#### InferencePool(s) +Here we create a pool that selects the appropriate pods ```yaml apiVersion: inference.x-k8s.io/v1alpha1 -kind: LLMServerPool +kind: InferencePool metadata: - name: llama-2-pool - services: - - llama-2-vllm ---- -apiVersion: inference.x-k8s.io/v1alpha1 -kind: LLMServerPool -metadata: - name: gemini-pool - services: - - gemini-jetstream-tpu-v5e - - gemini-vllm-a100 + name: base-model-pool + modelServerSelector: + - app: llm-server ``` -#### LLMService +#### InferenceModel -Here we consume both pools with a single LLMService, while also specifying 2 LLMServices. Where `sql-code-assist` is both the name of the ModelLLMService, and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified objective. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on both LLMServerPools and routing to each LLMServerPool happens earlier(at the K8s Gateway). So traffic splitting between separate pools happens at the K8s Gateway. +Here we consume the pool with two InferenceModels. Where `sql-code-assist` is both the name of the model and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified criticality. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on the InferencePool and routing to each InferencePool happens earlier (at the K8s Gateway). ```yaml apiVersion: inference.x-k8s.io/v1alpha1 -kind: LLMService +kind: InferenceModel +metadata: + name: sql-code-assist +spec: + modelName: sql-code-assist + poolRef: base-model-pool +--- +apiVersion: inference.x-k8s.io/v1alpha1 +kind: InferenceModel metadata: - name: my-llm-service + name: npc-bot spec: - LLMServices: - - modelName: sql-code-assist - - modelName: npc-bot - objective: - desiredAveragePerOutputTokenLatencyAtP95OverMultipleRequests: 50ms - targetModels: - targetModelName: npc-bot-v1 - weight: 50 - targetModelName: npc-bot-v2 - weight: 50 - poolRef: - - name: llama-2-pool - - name: gemini-pool + modelName: npc-bot + criticality: Critical + targetModels: + targetModelName: npc-bot-v1 + weight: 50 + targetModelName: npc-bot-v2 + weight: 50 + poolRef: base-model-pool ``` -### Diagrams - -Much of this is better explained visually: - -Below is a detailed view of the LLMServerPool - -![LLMServerPool](./images/lsp.svg) - -This diagram lightly follows the example request for a model `name-generator`. -The flow can be described as: -- The request comes in to our routing solution(Ext-Proc) -- ExtProc looks up the LLMServices affiliated with this pool `examplePool` -- `name-generator` is currently undergoing a change of LoRA adapters from `name-generator-v3` (20% traffic split) to `name-generator-v2` (80% traffic split) -- `name-generator-v2` is selected as the LoRA adapter, and replaces `name-generator` in the body of the request (mutated by ext-proc) -- the request is then efficiently scheduled onto one of the valid Pods -- Prometheus metrics are sent back to the LSP, aggregated and re-emitted via sidecar (following the metric standardization) - -How Multiple LLMServerPools might integrate together: - -![K8s Gateway with LLMServerPools](./images/gw_w_lsp.svg) - -Here we see that we can have: -- Multiple Routes pointing to the same pool -- Routes splitting traffic across multiple pools - -The functionality of the Kubernetes Gateway is unchanged with this proposal, allowing seamless integration with the LLMServerPool. - ### Alternatives #### Key Decisions Our alternatives hinge on some key decisions: -- Allowing HTTPRoute to treat the LLMServerPool as the backendRef - - Whereas the alternatives might have the LLMService as the backend ref +- Allowing HTTPRoute to treat the InferencePool as the backendRef + - Whereas the alternatives might have the InferenceModel as the backend ref - Creating a separate layer of abstraction, instead of extending HTTPRoute - Explained in more detail in the LLMRoute section -#### LLMService as a backend ref +#### InferenceModel as a backend ref -We toyed with the idea of allowing an LLMService be the target of an HTTPRouteRules backend ref. However, doing so would require the Kubernetes Gateway to be able to interpret body level parameters (assuming OpenAI protocol continues to require the model param in the body), and require that the HTTPRoute also specify the backend the LLMService is intended to run on. Since we our primary proposal already specifies the backend, packing this functionality would require substantial work on the Kubernetes Gateway, while not providing much flexibility. +We toyed with the idea of allowing an InferenceModel be the target of an HTTPRouteRules backend ref. However, doing so would require the Kubernetes Gateway to be able to interpret body level parameters (assuming OpenAI protocol continues to require the model param in the body), and require that the HTTPRoute also specify the backend the InferenceModel is intended to run on. Since our primary proposal already specifies the backend, packing this functionality would require substantial work on the Kubernetes Gateway, while not providing much flexibility. #### LLMRoute -Our original idea was to define all LLMService config at the Kubernetes Gateway layer, and have no LLMServerPool. This is inherently challenging, as LLMRoute would become a superset of HTTPRoute, or the Gateway would become bespoke, and work only for the LLMRoute use case. +Our original idea was to define all InferenceModel config at the Kubernetes Gateway layer, and have no InferencePool. This is inherently challenging, as LLMRoute would become a superset of HTTPRoute, or the Gateway would become bespoke, and work only for the LLMRoute use case. ## FAQ -- **Why 2 layers of weighting?** (HttpRoute & LLMService) - - Feasibly done - No extension of HttpRoute. Just works, as LLMServerPool operates like a service. +- **Why 2 layers of weighting?** (HttpRoute & InferenceModel) + - Feasibly done - No extension of HttpRoute. Just works, as InferencePool operates like a service. - Complexity is only expressed during transition states (model version upgrade) - Keeps Pools self contained - multiple K8s gateways can direct traffic to the same pool without needing to re-express Pool-level behavior -- **What is a LSP attempting to define?** - - LLMServerPool groups resources that should be shared over the LLMServices that are affiliated with the pool +- **What is an InferencePool attempting to define?** + - InferencePool groups resources that should be shared over the InferenceModels that are affiliated with the pool - Best practice would also suggest keeping the same base model for all ModelServers in the pool, but that is not enforced -- **Can a LLMService reference multiple LSPs?** - **How is this deployed?** - We will follow [common patterns](https://gateway.envoyproxy.io/docs/tasks/quickstart/#installation) to install the CRDs & Controllers -- **Are all controllers necessary for this solution going to be provided by Instance Gateway(this repo)?** +- **Are all controllers necessary for this solution going to be provided by this project?** - Yes - - -## Open Questions - -- Reasonable defaults (how do we behave in the absence of user-specified values in optional fields) - - Should services be required? Or can a customer simply create a pool, and direct requests to the pool, and expect even fairness/priority across the different LoRA adapters that are requested? - - If so? How should we handle the mix between explicit and implicit services? Are implicit LLMServices just default everything? (and inherently lower prio). - - NOTE: Current thinking is this is yes we should allow non-use case defined requests, but is a security risk if on by default. So pools should opt-in -- Configuration control - - How many routing decisions should we make on behalf of the user vs allow for configuration? - - Do we decide that SLO adherence is stricter than Fairness adherence? Do we allow for configuration of such tooling? (would be expressed in the LLMServerPool API)