From ca9a11072d61aaf481601765f15db35e567d324d Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Wed, 9 Apr 2025 11:50:21 -0700 Subject: [PATCH 01/22] Initial guide for inference pool --- site-src/api-types/inferencepool.md | 103 +++++++++++++++++++++++++--- 1 file changed, 93 insertions(+), 10 deletions(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index baa604b61..b994cdef4 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -7,28 +7,111 @@ ## Background -The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin. +The **InferencePool** Kubernetes custom resource defines a group of Pods (containers) that share the same compute configuration, accelerator type, base language model, and model server. This logically groups and manages your AI model serving resources, which offers administrative configuration to the Platform Admin. It is expected for the InferencePool to: - Enforce fair consumption of resources across competing workloads - - Efficiently route requests across shared compute (as displayed by the PoC) + - Efficiently route requests across shared compute It is _not_ expected for the InferencePool to: - - Enforce any common set of adapters or base models are available on the Pods - - Manage Deployments of Pods within the Pool - - Manage Pod lifecycle of pods within the pool + - Enforce any common set of adapters are available on the Pods + - Manage Deployments of Pods within the pool + - Manage pod lifecycle of Pods within the pool -Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests. +Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Pool has adequate information to intelligently route requests. -`InferencePool` has some small overlap with `Service`, displayed here: +## How to Configure an InferencePool + +The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool). + +In summary, the InferencePoolSpec consists of 3 major parts: + +- The `selector` field specifies which Pods belong to this pool. The labels in this selector must exactly match the labels applied to your model server Pods. +- The `targetPortNumber` field defines the port number that the model servers within the pool expect to receive traffic from. +- The `extensionRef` field references the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) (EPP) service that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions. + +### Example Configuration + +Here is an example InferencePool configuration: + +``` +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferencePool +metadata: + labels: + name: vllm-llama3-8b-instruct +spec: + targetPortNumber: 8000 + selector: + app: vllm-llama3-8b-instruct + extensionRef: + name: vllm-llama3-8b-instruct-epp + port: 9002 + failureMode: FailClose +``` + +In this example: +- An InferencePool named `vllm-llama3-8b-instruct` is created in the `default` namespace. +- It will select Pods that have the label `app: vllm-llama3-8b-instruct`. +- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port 9002 for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped. +- Traffic routed to this InferencePool will be forwarded to the port 8000 on the selected Pods. + +## Overlap with Service + +**InferencePool** has some small overlap with **Service**, displayed here: Comparing InferencePool with Service -The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management. +The InferencePool is not intended to be a mask of the Service object. It provides a specialized abstraction tailored for managing and routing traffic to groups of LLM model servers, allowing Platform Admins to focus on pool-level management rather than low-level networking details. + +## Replacing an InferencePool + +This section outlines how to perform gradual rollouts for updating base models by leveraging new InferencePools and traffic splitting using **HTTPRoute** resources. This approach minimizes service disruption and allows for safe rollbacks. + +To rollout a new base model: + +1. **Deploy new infrastructure**: Create new nodes and a new InferencePool configured with the new base model that you chose. +1. **Configure traffic distribution**: Use an HTTPRoute to split traffic between the existing InferencePool (which uses the old base model) and the new InferencePool (using the new base model). The `backendRefs.weight` field controls the traffic percentage allocated to each pool. +1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions. +1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. + +### Example + +You start with an existing lnferencePool named `llm-pool`. To replace the base model, you create a new InferencePool named `llm-pool-version-2`. This pool deploys a new version of the base model on a new set of nodes. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original llm-pool and llm-pool-version-2. This lets you control base model updates in your cluster. + +1. Save the following sample manifest as `httproute.yaml`: + + ``` + apiVersion: gateway.networking.k8s.io/v1 + kind: HTTPRoute + metadata: + name: llm-route + spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePool + name: llm-pool + weight: 90 + - group: inference.networking.x-k8s.io + kind: InferencePool + name: llm-pool-version-2 + weight: 10 + ``` + +1. Apply the sample manifest to your cluster: + + ``` + kubectl apply -f httproute.yaml + ``` -## Spec + The original `llm-pool` InferencePool receives most of the traffic, while the `llm-pool-version-2` InferencePool receives the rest. -The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool). \ No newline at end of file +1. Increase the traffic weight gradually for the `llm-pool-version-2` InferencePool to complete the base model update roll out. From 66c086070d45766261173f242f739b24d36d43ed Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Wed, 9 Apr 2025 14:13:44 -0700 Subject: [PATCH 02/22] Add extensionReference to the InferencePool spec --- site-src/reference/spec.md | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/site-src/reference/spec.md b/site-src/reference/spec.md index e16c113c1..e2d3d3164 100644 --- a/site-src/reference/spec.md +++ b/site-src/reference/spec.md @@ -135,8 +135,29 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `selector` _object (keys:[LabelKey](#labelkey), values:[LabelValue](#labelvalue))_ | Selector uses a map of label to watch model server pods
that should be included in the InferencePool. ModelServers should not
be with any other Service or InferencePool, that behavior is not supported
and will result in sub-optimal utilization.
In some cases, implementations may translate this to a Service selector, so this matches the simple
map used for Service selectors instead of the full Kubernetes LabelSelector type. | | Required: \{\}
| -| `targetPortNumber` _integer_ | TargetPortNumber is the port number that the model servers within the pool expect
to receive traffic from.
This maps to the TargetPort in: https://pkg.go.dev/k8s.io/api/core/v1#ServicePort | | Maximum: 65535
Minimum: 0
Required: \{\}
| +| `selector` _object (keys:[LabelKey](#labelkey), values:[LabelValue](#labelvalue))_ | Selector uses a map of label to watch model server pod that should be included in the InferencePool. ModelServers should not be with any other Service or InferencePool, that behavior is not supported and will result in sub-optimal utilization.
In some cases, implementations may translate this to a Service selector, so this matches the simple map used for Service selectors instead of the full Kubernetes LabelSelector type. | | Required: \{\}
| +| `targetPortNumber` _integer_ | TargetPortNumber is the port number that the model servers within the pool expect to receive traffic from.
This maps to the TargetPort in: https://pkg.go.dev/k8s.io/api/core/v1#ServicePort | | Maximum: 65535
Minimum: 0
Required: \{\}
| +| `extensionRef` _[Extension](#extension)_ | ExtensionRef configures the endpoint picker service extension that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions. | | Required: \{\}
| + + +#### Extension + + + +Extension specifies how to configure an extension that runs the endpoint picker. + + + +_Appears in:_ +- [InferencePoolSpec](#inferencepoolspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `group` _string_ | Group is the group of the extension reference. | "" | | +| `kind` _string_ | Kind is the kind of the extension reference. | Service | | +| `name` _string_ | Name is the name of the extension reference. | | Required: \{\}
| +| `portNumber` _integer_ | PortNumber is the port number on the service running the extension. | 9002 | | +| `failureMode` _string_ | FailureMode configures how the gateway handles the case when the extension is not responsive. | FailClose | | #### InferencePoolStatus From c4abfe5d5823636438fd445d8ec7fdb59e29db16 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Wed, 9 Apr 2025 14:24:54 -0700 Subject: [PATCH 03/22] Fix list formatting --- site-src/api-types/inferencepool.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index b994cdef4..03c5b6a8b 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -53,10 +53,11 @@ spec: ``` In this example: + - An InferencePool named `vllm-llama3-8b-instruct` is created in the `default` namespace. - It will select Pods that have the label `app: vllm-llama3-8b-instruct`. -- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port 9002 for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped. -- Traffic routed to this InferencePool will be forwarded to the port 8000 on the selected Pods. +- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped. +- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods. ## Overlap with Service From 63a60a2024002d0250e386de5a2e3648be06746c Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Wed, 9 Apr 2025 15:39:05 -0700 Subject: [PATCH 04/22] Remove unused labels --- site-src/api-types/inferencepool.md | 1 - 1 file changed, 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 03c5b6a8b..00372d6fb 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -40,7 +40,6 @@ Here is an example InferencePool configuration: apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: - labels: name: vllm-llama3-8b-instruct spec: targetPortNumber: 8000 From 3ebb03f6a836d6115d94046ef1cb208c710f1e0b Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Thu, 10 Apr 2025 16:40:12 +0000 Subject: [PATCH 05/22] Autogenerate the spec --- site-src/reference/spec.md | 295 ++++++++++++++++++++++++++++++------- 1 file changed, 241 insertions(+), 54 deletions(-) diff --git a/site-src/reference/spec.md b/site-src/reference/spec.md index e2d3d3164..d8e0c95bf 100644 --- a/site-src/reference/spec.md +++ b/site-src/reference/spec.md @@ -1,12 +1,14 @@ # API Reference ## Packages -- [inference.networking.x-k8s.io/v1alpha1](#inferencenetworkingx-k8siov1alpha1) +- [inference.networking.x-k8s.io/v1alpha2](#inferencenetworkingx-k8siov1alpha2) -## inference.networking.x-k8s.io/v1alpha1 +## inference.networking.x-k8s.io/v1alpha2 + +Package v1alpha2 contains API Schema definitions for the +inference.networking.x-k8s.io API group. -Package v1alpha1 contains API Schema definitions for the gateway v1alpha1 API group ### Resource Types - [InferenceModel](#inferencemodel) @@ -18,26 +20,152 @@ Package v1alpha1 contains API Schema definitions for the gateway v1alpha1 API gr _Underlying type:_ _string_ -Defines how important it is to serve the model compared to other models. +Criticality defines how important it is to serve the model compared to other models. +Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default. +This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior. _Validation:_ -- Enum: [Critical Default Sheddable] +- Enum: [Critical Standard Sheddable] _Appears in:_ - [InferenceModelSpec](#inferencemodelspec) | Field | Description | | --- | --- | -| `Critical` | Most important. Requests to this band will be shed last.
| -| `Default` | More important than Sheddable, less important than Critical.
Requests in this band will be shed before critical traffic.
+kubebuilder:default=Default
| -| `Sheddable` | Least important. Requests to this band will be shed before all other bands.
| +| `Critical` | Critical defines the highest level of criticality. Requests to this band will be shed last.
| +| `Standard` | Standard defines the base criticality level and is more important than Sheddable but less
important than Critical. Requests in this band will be shed before critical traffic.
Most models are expected to fall within this band.
| +| `Sheddable` | Sheddable defines the lowest level of criticality. Requests to this band will be shed before
all other bands.
| + + +#### EndpointPickerConfig + + + +EndpointPickerConfig specifies the configuration needed by the proxy to discover and connect to the endpoint picker extension. +This type is intended to be a union of mutually exclusive configuration options that we may add in the future. + + + +_Appears in:_ +- [InferencePoolSpec](#inferencepoolspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `extensionRef` _[Extension](#extension)_ | Extension configures an endpoint picker as an extension service. | | Required: \{\}
| + + +#### Extension + + + +Extension specifies how to configure an extension that runs the endpoint picker. + + + +_Appears in:_ +- [EndpointPickerConfig](#endpointpickerconfig) +- [InferencePoolSpec](#inferencepoolspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `group` _[Group](#group)_ | Group is the group of the referent.
The default value is "", representing the Core API group. | | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| +| `kind` _[Kind](#kind)_ | Kind is the Kubernetes resource kind of the referent. For example
"Service".
Defaults to "Service" when not specified.
ExternalName services can refer to CNAME DNS records that may live
outside of the cluster and as such are difficult to reason about in
terms of conformance. They also may not be safe to forward to (see
CVE-2021-25740 for more information). Implementations MUST NOT
support ExternalName Services. | Service | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| +| `name` _[ObjectName](#objectname)_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| +| `portNumber` _[PortNumber](#portnumber)_ | The port number on the service running the extension. When unspecified,
implementations SHOULD infer a default value of 9002 when the Kind is
Service. | | Maximum: 65535
Minimum: 1
| +| `failureMode` _[ExtensionFailureMode](#extensionfailuremode)_ | Configures how the gateway handles the case when the extension is not responsive.
Defaults to failClose. | FailClose | Enum: [FailOpen FailClose]
| + + +#### ExtensionConnection + + + +ExtensionConnection encapsulates options that configures the connection to the extension. + + + +_Appears in:_ +- [Extension](#extension) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `failureMode` _[ExtensionFailureMode](#extensionfailuremode)_ | Configures how the gateway handles the case when the extension is not responsive.
Defaults to failClose. | FailClose | Enum: [FailOpen FailClose]
| + + +#### ExtensionFailureMode + +_Underlying type:_ _string_ + +ExtensionFailureMode defines the options for how the gateway handles the case when the extension is not +responsive. + +_Validation:_ +- Enum: [FailOpen FailClose] + +_Appears in:_ +- [Extension](#extension) +- [ExtensionConnection](#extensionconnection) + +| Field | Description | +| --- | --- | +| `FailOpen` | FailOpen specifies that the proxy should not drop the request and forward the request to and endpoint of its picking.
| +| `FailClose` | FailClose specifies that the proxy should drop the request.
| + + +#### ExtensionReference + + + +ExtensionReference is a reference to the extension deployment. + + + +_Appears in:_ +- [Extension](#extension) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `group` _[Group](#group)_ | Group is the group of the referent.
The default value is "", representing the Core API group. | | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| +| `kind` _[Kind](#kind)_ | Kind is the Kubernetes resource kind of the referent. For example
"Service".
Defaults to "Service" when not specified.
ExternalName services can refer to CNAME DNS records that may live
outside of the cluster and as such are difficult to reason about in
terms of conformance. They also may not be safe to forward to (see
CVE-2021-25740 for more information). Implementations MUST NOT
support ExternalName Services. | Service | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| +| `name` _[ObjectName](#objectname)_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| +| `portNumber` _[PortNumber](#portnumber)_ | The port number on the service running the extension. When unspecified,
implementations SHOULD infer a default value of 9002 when the Kind is
Service. | | Maximum: 65535
Minimum: 1
| + + +#### Group + +_Underlying type:_ _string_ + +Group refers to a Kubernetes Group. It must either be an empty string or a +RFC 1123 subdomain. + +This validation is based off of the corresponding Kubernetes validation: +https://github.com/kubernetes/apimachinery/blob/02cfb53916346d085a6c6c7c66f882e3c6b0eca6/pkg/util/validation/validation.go#L208 + +Valid values include: + +* "" - empty string implies core Kubernetes API group +* "gateway.networking.k8s.io" +* "foo.example.com" + +Invalid values include: + +* "example.com/bar" - "/" is an invalid character + +_Validation:_ +- MaxLength: 253 +- Pattern: `^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$` + +_Appears in:_ +- [Extension](#extension) +- [ExtensionReference](#extensionreference) +- [PoolObjectReference](#poolobjectreference) + #### InferenceModel -InferenceModel is the Schema for the InferenceModels API +InferenceModel is the Schema for the InferenceModels API. @@ -45,29 +173,31 @@ InferenceModel is the Schema for the InferenceModels API | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha1` | | | +| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha2` | | | | `kind` _string_ | `InferenceModel` | | | | `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | | `spec` _[InferenceModelSpec](#inferencemodelspec)_ | | | | | `status` _[InferenceModelStatus](#inferencemodelstatus)_ | | | | + + + + #### InferenceModelSpec -InferenceModelSpec represents a specific model use case. This resource is +InferenceModelSpec represents the desired state of a specific model use case. This resource is managed by the "Inference Workload Owner" persona. - -The Inference Workload Owner persona is: a team that trains, verifies, and +The Inference Workload Owner persona is someone that trains, verifies, and leverages a large language model from a model frontend, drives the lifecycle and rollout of new versions of those models, and defines the specific performance and latency goals for the model. These workloads are expected to operate within an InferencePool sharing compute capacity with other InferenceModels, defined by the Inference Platform Admin. - InferenceModel's modelName (not the ObjectMeta name) is unique for a given InferencePool, if the name is reused, an error will be shown on the status of a InferenceModel that attempted to reuse. The oldest InferenceModel, based on @@ -81,10 +211,10 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `modelName` _string_ | The name of the model as the users set in the "model" parameter in the requests.
The name should be unique among the workloads that reference the same backend pool.
This is the parameter that will be used to match the request with. In the future, we may
allow to match on other request parameters. The other approach to support matching on
on other request parameters is to use a different ModelName per HTTPFilter.
Names can be reserved without implementing an actual model in the pool.
This can be done by specifying a target model and setting the weight to zero,
an error will be returned specifying that no valid target model is found. | | MaxLength: 253
| -| `criticality` _[Criticality](#criticality)_ | Defines how important it is to serve the model compared to other models referencing the same pool. | Default | Enum: [Critical Default Sheddable]
| -| `targetModels` _[TargetModel](#targetmodel) array_ | Allow multiple versions of a model for traffic splitting.
If not specified, the target model name is defaulted to the modelName parameter.
modelName is often in reference to a LoRA adapter. | | MaxItems: 10
| -| `poolRef` _[PoolObjectReference](#poolobjectreference)_ | Reference to the inference pool, the pool must exist in the same namespace. | | Required: \{\}
| +| `modelName` _string_ | ModelName is the name of the model as it will be set in the "model" parameter for an incoming request.
ModelNames must be unique for a referencing InferencePool
(names can be reused for a different pool in the same cluster).
The modelName with the oldest creation timestamp is retained, and the incoming
InferenceModel is sets the Ready status to false with a corresponding reason.
In the rare case of a race condition, one Model will be selected randomly to be considered valid, and the other rejected.
Names can be reserved without an underlying model configured in the pool.
This can be done by specifying a target model and setting the weight to zero,
an error will be returned specifying that no valid target model is found. | | MaxLength: 256
Required: \{\}
| +| `criticality` _[Criticality](#criticality)_ | Criticality defines how important it is to serve the model compared to other models referencing the same pool.
Criticality impacts how traffic is handled in resource constrained situations. It handles this by
queuing or rejecting requests of lower criticality. InferenceModels of an equivalent Criticality will
fairly share resources over throughput of tokens. In the future, the metric used to calculate fairness,
and the proportionality of fairness will be configurable.
Default values for this field will not be set, to allow for future additions of new field that may 'one of' with this field.
Any implementations that may consume this field may treat an unset value as the 'Standard' range. | | Enum: [Critical Standard Sheddable]
| +| `targetModels` _[TargetModel](#targetmodel) array_ | TargetModels allow multiple versions of a model for traffic splitting.
If not specified, the target model name is defaulted to the modelName parameter.
modelName is often in reference to a LoRA adapter. | | MaxItems: 10
| +| `poolRef` _[PoolObjectReference](#poolobjectreference)_ | PoolRef is a reference to the inference pool, the pool must exist in the same namespace. | | Required: \{\}
| #### InferenceModelStatus @@ -100,14 +230,14 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferencePool. | | | +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferenceModel.
Known condition types are:
* "Accepted" | [map[lastTransitionTime:1970-01-01T00:00:00Z message:Waiting for controller reason:Pending status:Unknown type:Ready]] | MaxItems: 8
| #### InferencePool -InferencePool is the Schema for the Inferencepools API +InferencePool is the Schema for the InferencePools API. @@ -115,13 +245,17 @@ InferencePool is the Schema for the Inferencepools API | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha1` | | | +| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha2` | | | | `kind` _string_ | `InferencePool` | | | | `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | | `spec` _[InferencePoolSpec](#inferencepoolspec)_ | | | | | `status` _[InferencePoolStatus](#inferencepoolstatus)_ | | | | + + + + #### InferencePoolSpec @@ -135,71 +269,74 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `selector` _object (keys:[LabelKey](#labelkey), values:[LabelValue](#labelvalue))_ | Selector uses a map of label to watch model server pod that should be included in the InferencePool. ModelServers should not be with any other Service or InferencePool, that behavior is not supported and will result in sub-optimal utilization.
In some cases, implementations may translate this to a Service selector, so this matches the simple map used for Service selectors instead of the full Kubernetes LabelSelector type. | | Required: \{\}
| -| `targetPortNumber` _integer_ | TargetPortNumber is the port number that the model servers within the pool expect to receive traffic from.
This maps to the TargetPort in: https://pkg.go.dev/k8s.io/api/core/v1#ServicePort | | Maximum: 65535
Minimum: 0
Required: \{\}
| -| `extensionRef` _[Extension](#extension)_ | ExtensionRef configures the endpoint picker service extension that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions. | | Required: \{\}
| +| `selector` _object (keys:[LabelKey](#labelkey), values:[LabelValue](#labelvalue))_ | Selector defines a map of labels to watch model server pods
that should be included in the InferencePool.
In some cases, implementations may translate this field to a Service selector, so this matches the simple
map used for Service selectors instead of the full Kubernetes LabelSelector type.
If sepecified, it will be applied to match the model server pods in the same namespace as the InferencePool.
Cross namesoace selector is not supported. | | Required: \{\}
| +| `targetPortNumber` _integer_ | TargetPortNumber defines the port number to access the selected model servers.
The number must be in the range 1 to 65535. | | Maximum: 65535
Minimum: 1
Required: \{\}
| +| `extensionRef` _[Extension](#extension)_ | Extension configures an endpoint picker as an extension service. | | Required: \{\}
| -#### Extension +#### InferencePoolStatus -Extension specifies how to configure an extension that runs the endpoint picker. +InferencePoolStatus defines the observed state of InferencePool _Appears in:_ -- [InferencePoolSpec](#inferencepoolspec) +- [InferencePool](#inferencepool) | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `group` _string_ | Group is the group of the extension reference. | "" | | -| `kind` _string_ | Kind is the kind of the extension reference. | Service | | -| `name` _string_ | Name is the name of the extension reference. | | Required: \{\}
| -| `portNumber` _integer_ | PortNumber is the port number on the service running the extension. | 9002 | | -| `failureMode` _string_ | FailureMode configures how the gateway handles the case when the extension is not responsive. | FailClose | | +| `parent` _[PoolStatus](#poolstatus) array_ | Parents is a list of parent resources (usually Gateways) that are
associated with the route, and the status of the InferencePool with respect to
each parent.
A maximum of 32 Gateways will be represented in this list. An empty list
means the route has not been attached to any Gateway. | | MaxItems: 32
| -#### InferencePoolStatus +#### Kind + +_Underlying type:_ _string_ +Kind refers to a Kubernetes Kind. +Valid values include: -InferencePoolStatus defines the observed state of InferencePool +* "Service" +* "HTTPRoute" +Invalid values include: +* "invalid/kind" - "/" is an invalid character + +_Validation:_ +- MaxLength: 63 +- MinLength: 1 +- Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$` _Appears in:_ -- [InferencePool](#inferencepool) +- [Extension](#extension) +- [ExtensionReference](#extensionreference) +- [PoolObjectReference](#poolobjectreference) -| Field | Description | Default | Validation | -| --- | --- | --- | --- | -| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferencePool. | | | #### LabelKey _Underlying type:_ _string_ -Originally copied from: https://github.com/kubernetes-sigs/gateway-api/blob/99a3934c6bc1ce0874f3a4c5f20cafd8977ffcb4/apis/v1/shared_types.go#L694-L731 +LabelKey was originally copied from: https://github.com/kubernetes-sigs/gateway-api/blob/99a3934c6bc1ce0874f3a4c5f20cafd8977ffcb4/apis/v1/shared_types.go#L694-L731 Duplicated as to not take an unexpected dependency on gw's API. - LabelKey is the key of a label. This is used for validation of maps. This matches the Kubernetes "qualified name" validation that is used for labels. - +Labels are case sensitive, so: my-label and My-Label are considered distinct. Valid values include: - * example * example.com * example.com/path * example.com/path.html - Invalid values include: - * example~ - "~" is an invalid character * example.com. - can not start or end with "." @@ -223,10 +360,8 @@ of maps. This matches the Kubernetes label validation rules: * unless empty, must begin and end with an alphanumeric character ([a-z0-9A-Z]), * could contain dashes (-), underscores (_), dots (.), and alphanumerics between. - Valid values include: - * MyValue * my.name * 123-my-value @@ -241,6 +376,25 @@ _Appears in:_ +#### ObjectName + +_Underlying type:_ _string_ + +ObjectName refers to the name of a Kubernetes object. +Object names can have a variety of forms, including RFC 1123 subdomains, +RFC 1123 labels, or RFC 1035 labels. + +_Validation:_ +- MaxLength: 253 +- MinLength: 1 + +_Appears in:_ +- [Extension](#extension) +- [ExtensionReference](#extensionreference) +- [PoolObjectReference](#poolobjectreference) + + + #### PoolObjectReference @@ -255,9 +409,42 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `group` _string_ | Group is the group of the referent. | inference.networking.x-k8s.io | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| -| `kind` _string_ | Kind is kind of the referent. For example "InferencePool". | InferencePool | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| -| `name` _string_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| +| `group` _[Group](#group)_ | Group is the group of the referent. | inference.networking.x-k8s.io | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| +| `kind` _[Kind](#kind)_ | Kind is kind of the referent. For example "InferencePool". | InferencePool | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| +| `name` _[ObjectName](#objectname)_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| + + +#### PoolStatus + + + +PoolStatus defines the observed state of InferencePool from a Gateway. + + + +_Appears in:_ +- [InferencePoolStatus](#inferencepoolstatus) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `parentRef` _[ObjectReference](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectreference-v1-core)_ | GatewayRef indicates the gateway that observed state of InferencePool. | | | +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferencePool.
Known condition types are:
* "Accepted"
* "ResolvedRefs" | [map[lastTransitionTime:1970-01-01T00:00:00Z message:Waiting for controller reason:Pending status:Unknown type:Accepted]] | MaxItems: 8
| + + +#### PortNumber + +_Underlying type:_ _integer_ + +PortNumber defines a network port. + +_Validation:_ +- Maximum: 65535 +- Minimum: 1 + +_Appears in:_ +- [Extension](#extension) +- [ExtensionReference](#extensionreference) + #### TargetModel @@ -267,10 +454,10 @@ _Appears in:_ TargetModel represents a deployed model or a LoRA adapter. The Name field is expected to match the name of the LoRA adapter (or base model) as it is registered within the model server. Inference -Gateway assumes that the model exists on the model server and is the +Gateway assumes that the model exists on the model server and it's the responsibility of the user to validate a correct match. Should a model fail -to exist at request time, the error is processed by the Instance Gateway, -and then emitted on the appropriate InferenceModel object. +to exist at request time, the error is processed by the Inference Gateway +and emitted on the appropriate InferenceModel object. @@ -279,7 +466,7 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `name` _string_ | The name of the adapter as expected by the ModelServer. | | MaxLength: 253
| -| `weight` _integer_ | Weight is used to determine the proportion of traffic that should be
sent to this target model when multiple versions of the model are specified. | 1 | Maximum: 1e+06
Minimum: 0
| +| `name` _string_ | Name is the name of the adapter or base model, as expected by the ModelServer. | | MaxLength: 253
Required: \{\}
| +| `weight` _integer_ | Weight is used to determine the proportion of traffic that should be
sent to this model when multiple target models are specified.
Weight defines the proportion of requests forwarded to the specified
model. This is computed as weight/(sum of all weights in this
TargetModels list). For non-zero values, there may be some epsilon from
the exact proportion defined here depending on the precision an
implementation supports. Weight is not a percentage and the sum of
weights does not need to equal 100.
If a weight is set for any targetModel, it must be set for all targetModels.
Conversely weights are optional, so long as ALL targetModels do not specify a weight. | | Maximum: 1e+06
Minimum: 1
| From fac37a9c5727a85f67d6a31119f912ef499f1483 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:28:27 -0700 Subject: [PATCH 06/22] Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott --- site-src/api-types/inferencepool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 00372d6fb..1ae33cf4e 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -7,7 +7,7 @@ ## Background -The **InferencePool** Kubernetes custom resource defines a group of Pods (containers) that share the same compute configuration, accelerator type, base language model, and model server. This logically groups and manages your AI model serving resources, which offers administrative configuration to the Platform Admin. +The **InferencePool** API defines a group of Pods (containers) that share the same compute configuration, accelerator type, base language model, and model server. This logically groups and manages your AI model serving resources, which offers administrative configuration to the Platform Admin. It is expected for the InferencePool to: From 5953c37f1f828549f49160b1276fa2a1c5f6bb9c Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:28:44 -0700 Subject: [PATCH 07/22] Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott --- site-src/api-types/inferencepool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 1ae33cf4e..9a6cfa6e5 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -20,7 +20,7 @@ It is _not_ expected for the InferencePool to: - Manage Deployments of Pods within the pool - Manage pod lifecycle of Pods within the pool -Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Pool has adequate information to intelligently route requests. +Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Endpoint Picker has adequate information to intelligently route requests. ## How to Configure an InferencePool From 41fa7082806882754e359a0e4960256cb8c61d88 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:28:59 -0700 Subject: [PATCH 08/22] Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott --- site-src/api-types/inferencepool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 9a6cfa6e5..28f194e3d 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -29,7 +29,7 @@ The full spec of the InferencePool is defined [here](/reference/spec/#inferencep In summary, the InferencePoolSpec consists of 3 major parts: - The `selector` field specifies which Pods belong to this pool. The labels in this selector must exactly match the labels applied to your model server Pods. -- The `targetPortNumber` field defines the port number that the model servers within the pool expect to receive traffic from. +- The `targetPortNumber` field defines the port number that the Inference Gateway should route to on model server Pods that belong to this pool. - The `extensionRef` field references the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) (EPP) service that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions. ### Example Configuration From c22d962889c478fcd3c7071409cf0630831a2580 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:29:47 -0700 Subject: [PATCH 09/22] Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott --- site-src/api-types/inferencepool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 28f194e3d..142afc610 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -73,7 +73,7 @@ This section outlines how to perform gradual rollouts for updating base models b To rollout a new base model: -1. **Deploy new infrastructure**: Create new nodes and a new InferencePool configured with the new base model that you chose. +1. **Deploy new infrastructure**: Create a new InferencePool configured with the new base model that you chose. 1. **Configure traffic distribution**: Use an HTTPRoute to split traffic between the existing InferencePool (which uses the old base model) and the new InferencePool (using the new base model). The `backendRefs.weight` field controls the traffic percentage allocated to each pool. 1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions. 1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. From bbda2b5b1c57274baddbd33f7ea6877f17d69415 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:29:57 -0700 Subject: [PATCH 10/22] Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott --- site-src/api-types/inferencepool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 142afc610..dc26c4a59 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -74,7 +74,7 @@ This section outlines how to perform gradual rollouts for updating base models b To rollout a new base model: 1. **Deploy new infrastructure**: Create a new InferencePool configured with the new base model that you chose. -1. **Configure traffic distribution**: Use an HTTPRoute to split traffic between the existing InferencePool (which uses the old base model) and the new InferencePool (using the new base model). The `backendRefs.weight` field controls the traffic percentage allocated to each pool. +1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool (which uses the old base model) and the new InferencePool (using the new base model). The `backendRefs.weight` field controls the traffic percentage allocated to each pool. 1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions. 1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. From 6848995fe55b4d3e9fc0ca595afb675a78071a0b Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:31:22 -0700 Subject: [PATCH 11/22] Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott --- site-src/api-types/inferencepool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index dc26c4a59..96c9a78c2 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -80,7 +80,7 @@ To rollout a new base model: ### Example -You start with an existing lnferencePool named `llm-pool`. To replace the base model, you create a new InferencePool named `llm-pool-version-2`. This pool deploys a new version of the base model on a new set of nodes. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original llm-pool and llm-pool-version-2. This lets you control base model updates in your cluster. +You start with an existing lnferencePool named `llm-pool-v1`. To replace the base model, you create a new InferencePool named `llm-pool-v2`. This pool deploys a new version of the base model on a new set of Pods. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original llm-pool and llm-pool-version-2. This lets you control base model updates in your cluster. 1. Save the following sample manifest as `httproute.yaml`: From 1c14218d56e5da93eef00c14a96436f00b9f28e7 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:33:47 -0700 Subject: [PATCH 12/22] Rename llm-pool names in rollout example --- site-src/api-types/inferencepool.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 96c9a78c2..d6835545d 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -80,7 +80,7 @@ To rollout a new base model: ### Example -You start with an existing lnferencePool named `llm-pool-v1`. To replace the base model, you create a new InferencePool named `llm-pool-v2`. This pool deploys a new version of the base model on a new set of Pods. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original llm-pool and llm-pool-version-2. This lets you control base model updates in your cluster. +You start with an existing lnferencePool named `llm-pool-v1`. To replace the base model, you create a new InferencePool named `llm-pool-v2`. This pool deploys a new version of the base model on a new set of Pods. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and `llm-pool-v2`. This lets you control base model updates in your cluster. 1. Save the following sample manifest as `httproute.yaml`: @@ -98,11 +98,11 @@ You start with an existing lnferencePool named `llm-pool-v1`. To replace the bas backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool - name: llm-pool + name: llm-pool-v1 weight: 90 - group: inference.networking.x-k8s.io kind: InferencePool - name: llm-pool-version-2 + name: llm-pool-v2 weight: 10 ``` @@ -112,6 +112,6 @@ You start with an existing lnferencePool named `llm-pool-v1`. To replace the bas kubectl apply -f httproute.yaml ``` - The original `llm-pool` InferencePool receives most of the traffic, while the `llm-pool-version-2` InferencePool receives the rest. + The original `llm-pool-v1` InferencePool receives most of the traffic, while the `llm-pool-v2` InferencePool receives the rest. -1. Increase the traffic weight gradually for the `llm-pool-version-2` InferencePool to complete the base model update roll out. +1. Increase the traffic weight gradually for the `llm-pool-v2` InferencePool to complete the base model update roll out. From 68a49a21bff4834f2724892db85bd5e4364fef33 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 09:45:25 -0700 Subject: [PATCH 13/22] Add use cases for replacing an inference pool --- site-src/api-types/inferencepool.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index d6835545d..3fa5b6c73 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -69,7 +69,13 @@ The InferencePool is not intended to be a mask of the Service object. It provide ## Replacing an InferencePool -This section outlines how to perform gradual rollouts for updating base models by leveraging new InferencePools and traffic splitting using **HTTPRoute** resources. This approach minimizes service disruption and allows for safe rollbacks. +Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary. + +Use Cases for Replacing an InferencePool: + +- Upgrading or replacing your model server framework +- Upgrading or replacing your base model +- Transitioning to new hardware To rollout a new base model: From 3e4285a81d24e588e9a984c4e38bc16460e375ca Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 10:09:58 -0700 Subject: [PATCH 14/22] Rewording the background section --- site-src/api-types/inferencepool.md | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 3fa5b6c73..d574169ac 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -7,18 +7,9 @@ ## Background -The **InferencePool** API defines a group of Pods (containers) that share the same compute configuration, accelerator type, base language model, and model server. This logically groups and manages your AI model serving resources, which offers administrative configuration to the Platform Admin. +The **InferencePool** API defines a group of Pods (containers) dedicated to serving AI models. Pods within an InferencePool share the same compute configuration, accelerator type, base language model, and model server. This abstraction simplifies the management of AI model serving resources, providing a centralized point of administrative configuration for Platform Admins. -It is expected for the InferencePool to: - - - Enforce fair consumption of resources across competing workloads - - Efficiently route requests across shared compute - -It is _not_ expected for the InferencePool to: - - - Enforce any common set of adapters are available on the Pods - - Manage Deployments of Pods within the pool - - Manage pod lifecycle of Pods within the pool +An InferencePool is expected to be bundled with an [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) extension. This extension is responsible for tracking key metrics on each model server (i.e. the KV-cache utilization, queue length of pending requests, active LoRA adapters, etc.) and routing incoming inference requests to the optimal model server replica based on these metrics. Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Endpoint Picker has adequate information to intelligently route requests. From a2945e9a99a0da7d6fd5a5fae4305cc9d25d1222 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 10:18:43 -0700 Subject: [PATCH 15/22] Create replacing-inference-pool.md --- site-src/guides/replacing-inference-pool.md | 54 +++++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 site-src/guides/replacing-inference-pool.md diff --git a/site-src/guides/replacing-inference-pool.md b/site-src/guides/replacing-inference-pool.md new file mode 100644 index 000000000..621c112d2 --- /dev/null +++ b/site-src/guides/replacing-inference-pool.md @@ -0,0 +1,54 @@ +# Replacing an InferencePool + +Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary. + +Use Cases for Replacing an InferencePool: + +- Upgrading or replacing your model server framework +- Upgrading or replacing your base model +- Transitioning to new hardware + +To replacing an InferencePool: + +1. **Deploy new infrastructure**: Create a new InferencePool configured with the new hardware / model server / base model that you chose. +1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool and the new InferencePool. The `backendRefs.weight` field controls the traffic percentage allocated to each pool. +1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions. +1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. + +## Example + +You start with an existing lnferencePool named `llm-pool-v1`. To replace the original InferencePool, you create a new InferencePool named `llm-pool-v2`. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and new `llm-pool-v2`. + +1. Save the following sample manifest as `httproute.yaml`: + + ``` + apiVersion: gateway.networking.k8s.io/v1 + kind: HTTPRoute + metadata: + name: llm-route + spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePool + name: llm-pool-v1 + weight: 90 + - group: inference.networking.x-k8s.io + kind: InferencePool + name: llm-pool-v2 + weight: 10 + ``` + +1. Apply the sample manifest to your cluster: + + ``` + kubectl apply -f httproute.yaml + ``` + + The original `llm-pool-v1` InferencePool receives most of the traffic, while the `llm-pool-v2` InferencePool receives the rest. + +1. Increase the traffic weight gradually for the `llm-pool-v2` InferencePool to complete the new InferencePool roll out. From 7520f6fae34f77a59fa342d260a6f1ddd50adcce Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 10:33:28 -0700 Subject: [PATCH 16/22] Replace instructions with a link for how to replace an inference pool --- site-src/api-types/inferencepool.md | 54 +---------------------------- 1 file changed, 1 insertion(+), 53 deletions(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index d574169ac..6bba95e66 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -59,56 +59,4 @@ In this example: The InferencePool is not intended to be a mask of the Service object. It provides a specialized abstraction tailored for managing and routing traffic to groups of LLM model servers, allowing Platform Admins to focus on pool-level management rather than low-level networking details. ## Replacing an InferencePool - -Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary. - -Use Cases for Replacing an InferencePool: - -- Upgrading or replacing your model server framework -- Upgrading or replacing your base model -- Transitioning to new hardware - -To rollout a new base model: - -1. **Deploy new infrastructure**: Create a new InferencePool configured with the new base model that you chose. -1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool (which uses the old base model) and the new InferencePool (using the new base model). The `backendRefs.weight` field controls the traffic percentage allocated to each pool. -1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions. -1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. - -### Example - -You start with an existing lnferencePool named `llm-pool-v1`. To replace the base model, you create a new InferencePool named `llm-pool-v2`. This pool deploys a new version of the base model on a new set of Pods. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and `llm-pool-v2`. This lets you control base model updates in your cluster. - -1. Save the following sample manifest as `httproute.yaml`: - - ``` - apiVersion: gateway.networking.k8s.io/v1 - kind: HTTPRoute - metadata: - name: llm-route - spec: - parentRefs: - - group: gateway.networking.k8s.io - kind: Gateway - name: inference-gateway - rules: - backendRefs: - - group: inference.networking.x-k8s.io - kind: InferencePool - name: llm-pool-v1 - weight: 90 - - group: inference.networking.x-k8s.io - kind: InferencePool - name: llm-pool-v2 - weight: 10 - ``` - -1. Apply the sample manifest to your cluster: - - ``` - kubectl apply -f httproute.yaml - ``` - - The original `llm-pool-v1` InferencePool receives most of the traffic, while the `llm-pool-v2` InferencePool receives the rest. - -1. Increase the traffic weight gradually for the `llm-pool-v2` InferencePool to complete the base model update roll out. +Please refer to the [Replacing an InferencePool](/guides/replacing-inference-pool) guide for details on uses cases and how to replace an InferencePool. From 860e121825c3cf48d1755fdeed9099880903ba51 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 10:36:51 -0700 Subject: [PATCH 17/22] Update replacing-inference-pool.md --- site-src/guides/replacing-inference-pool.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/site-src/guides/replacing-inference-pool.md b/site-src/guides/replacing-inference-pool.md index 621c112d2..c17c0fdee 100644 --- a/site-src/guides/replacing-inference-pool.md +++ b/site-src/guides/replacing-inference-pool.md @@ -1,7 +1,10 @@ # Replacing an InferencePool +## Background + Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary. +## Use Cases Use Cases for Replacing an InferencePool: - Upgrading or replacing your model server framework From 53e16b8ef7936406e7daa04a332ba04bb477cbd6 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 10:39:22 -0700 Subject: [PATCH 18/22] Update mkdocs.yml --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index b67cf8b4b..6faa621a1 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -61,6 +61,7 @@ nav: - Getting started: guides/index.md - Adapter Rollout: guides/adapter-rollout.md - Metrics: guides/metrics.md + - Replacing an Inference Pool: guides/replacing-inference-pool.md - Implementer's Guide: guides/implementers.md - Performance: - Benchmark: performance/benchmark/index.md From ca7e02eb8156c2d060319b9c66896c53b5237b0c Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 11 Apr 2025 10:48:40 -0700 Subject: [PATCH 19/22] Update replacing-inference-pool.md --- site-src/guides/replacing-inference-pool.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/site-src/guides/replacing-inference-pool.md b/site-src/guides/replacing-inference-pool.md index c17c0fdee..0abba044a 100644 --- a/site-src/guides/replacing-inference-pool.md +++ b/site-src/guides/replacing-inference-pool.md @@ -11,6 +11,8 @@ Use Cases for Replacing an InferencePool: - Upgrading or replacing your base model - Transitioning to new hardware +## How to replace an InferencePool + To replacing an InferencePool: 1. **Deploy new infrastructure**: Create a new InferencePool configured with the new hardware / model server / base model that you chose. @@ -18,7 +20,7 @@ To replacing an InferencePool: 1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions. 1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. -## Example +### Example You start with an existing lnferencePool named `llm-pool-v1`. To replace the original InferencePool, you create a new InferencePool named `llm-pool-v2`. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and new `llm-pool-v2`. From 85c9311676e2ce90cf82ef0666f368e1abfbe90c Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Wed, 16 Apr 2025 09:19:22 -0700 Subject: [PATCH 20/22] Update inferencemodel_types.go --- api/v1alpha2/inferencemodel_types.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/api/v1alpha2/inferencemodel_types.go b/api/v1alpha2/inferencemodel_types.go index 052683d88..7cd98a740 100644 --- a/api/v1alpha2/inferencemodel_types.go +++ b/api/v1alpha2/inferencemodel_types.go @@ -126,7 +126,7 @@ type PoolObjectReference struct { } // Criticality defines how important it is to serve the model compared to other models. -// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default. +// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional (use a pointer), and set no default. // This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior. // +kubebuilder:validation:Enum=Critical;Standard;Sheddable type Criticality string From ee580dadb94c9f2a9108d1cc8627a5aabc6c48ab Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 18 Apr 2025 14:56:53 -0700 Subject: [PATCH 21/22] Update inferencepool.md --- site-src/api-types/inferencepool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index 6bba95e66..1494d314e 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -9,7 +9,7 @@ The **InferencePool** API defines a group of Pods (containers) dedicated to serving AI models. Pods within an InferencePool share the same compute configuration, accelerator type, base language model, and model server. This abstraction simplifies the management of AI model serving resources, providing a centralized point of administrative configuration for Platform Admins. -An InferencePool is expected to be bundled with an [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) extension. This extension is responsible for tracking key metrics on each model server (i.e. the KV-cache utilization, queue length of pending requests, active LoRA adapters, etc.) and routing incoming inference requests to the optimal model server replica based on these metrics. +An InferencePool is expected to be bundled with an [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) extension. This extension is responsible for tracking key metrics on each model server (i.e. the KV-cache utilization, queue length of pending requests, active LoRA adapters, etc.) and routing incoming inference requests to the optimal model server replica based on these metrics. An EPP can only be associated with a single InferencePool. The associated InferencePool is specified by the [poolName](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L54) and [poolNamespace](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L56) flags. An HTTPRoute can have multiple backendRefs that reference the same InferencePool and therefore routes to the same EPP. An HTTPRoute can have multiple backendRefs that reference different InferencePools and therefore routes to different EPPs. Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Endpoint Picker has adequate information to intelligently route requests. From 4fdfadfb244714a41da09827880f6624cd4eae26 Mon Sep 17 00:00:00 2001 From: Nicole Xin Date: Fri, 18 Apr 2025 16:20:03 -0700 Subject: [PATCH 22/22] Update site-src/guides/replacing-inference-pool.md Co-authored-by: Rob Scott --- site-src/guides/replacing-inference-pool.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/site-src/guides/replacing-inference-pool.md b/site-src/guides/replacing-inference-pool.md index 0abba044a..212945706 100644 --- a/site-src/guides/replacing-inference-pool.md +++ b/site-src/guides/replacing-inference-pool.md @@ -26,7 +26,7 @@ You start with an existing lnferencePool named `llm-pool-v1`. To replace the ori 1. Save the following sample manifest as `httproute.yaml`: - ``` + ```yaml apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: