You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: site-src/api-types/inferencemodel.md
+7-2
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,13 @@
7
7
8
8
## Background
9
9
10
-
TODO
10
+
An InferenceModel allows the Inference Workload Owner to define:
11
+
12
+
- Which Model/LoRA adapter(s) to consume.
13
+
- Mapping from a client facing model name to the target model name in the InferencePool.
14
+
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
15
+
- Criticality of the requests to the InferenceModel.
11
16
12
17
## Spec
13
18
14
-
TODO
19
+
The full spec of the InferenceModel is defined [here](/reference/spec/#inferencemodel).
Copy file name to clipboardExpand all lines: site-src/api-types/inferencepool.md
+18-2
Original file line number
Diff line number
Diff line change
@@ -7,12 +7,28 @@
7
7
8
8
## Background
9
9
10
-
InferencePool is
10
+
The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin.
11
+
12
+
It is expected for the InferencePool to:
13
+
14
+
- Enforce fair consumption of resources across competing workloads
15
+
- Efficiently route requests across shared compute (as displayed by the PoC)
16
+
17
+
It is _not_ expected for the InferencePool to:
18
+
19
+
- Enforce any common set of adapters or base models are available on the Pods
20
+
- Manage Deployments of Pods within the Pool
21
+
- Manage Pod lifecycle of pods within the pool
22
+
23
+
Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
24
+
25
+
`InferencePool` has some small overlap with `Service`, displayed here:
<imgsrc="/images/inferencepool-vs-service.png"alt="Comparing InferencePool with Service"class="center"width="550" />
14
29
30
+
The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management.
15
31
16
32
## Spec
17
33
18
-
TODO
34
+
The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).
The Gateway API Inference Extension project is an extension of the Kubernetes Gateway API for serving Generative AI models on Kubernetes. Gateway API Inference Extension facilitates standardization of APIs for Kubernetes cluster operators and developers running generative AI inference, while allowing flexibility for underlying gateway implementations (such as Envoy Proxy) to iterate on mechanisms for optimized serving of models.
5
+
6
+
<imgsrc="/images/inference-overview.svg"alt="Overview of API integration"class="center"width="1000" />
7
+
8
+
## API Resources
9
+
10
+
### InferencePool
11
+
12
+
InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferenceModel, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool.md) or go directly to the [InferencePool spec](/reference/spec/#inferencepool).
13
+
14
+
### InferenceModel
15
+
16
+
An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel.md) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel).
Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design.
4
+
5
+
## Inference Platform Admin
6
+
7
+
The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for:
8
+
9
+
- Hardware
10
+
- Model Server
11
+
- Base Model
12
+
- Resource Allocation for Workloads
13
+
- Gateway configuration
14
+
- etc
15
+
16
+
## Inference Workload Owner
17
+
18
+
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
<imgsrc="/images/resource-model.png"alt="Gateway API Inference Extension Resource Model"class="center"width="550" />
13
13
14
+
## Key Features
15
+
Gateway API Inference Extension, along with a reference implementation in Envoy Proxy, provides the following key features:
16
+
17
+
-**Model-aware routing**: Instead of simply routing based on the path of the request, Gateway API Inference Extension allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models.
18
+
19
+
-**Serving priority**: Gateway API Inference Extension allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Criticality*](/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization.
20
+
21
+
-**Model rollouts**: Gateway API Inference Extension allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names.
22
+
23
+
-**Extensibility for Inference Services**: Gateway API Inference Extension defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs.
24
+
25
+
26
+
-**Customizable Load Balancing for Inference**: Gateway API Inference Extension defines a pattern for customizable load balancing and request routing that is optimized for Inference. Gateway API Inference Extension provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters.
27
+
28
+
14
29
## API Resources
15
30
16
-
### InferencePool
17
-
18
-
InferencePool represents a set of Inference-focused Pods and an extension that
19
-
will be used to route to them. Within the broader Gateway API resource model,
20
-
this resource is considered a "backend". In practice, that means that you'd
21
-
replace a Kubernetes Service with an InferencePool. This resource has some
22
-
similarities to Service (a way to select Pods and specify a port), but will
23
-
expand to have some inference-specific capabilities. When combined with
24
-
InferenceModel, you can configure a routing extension as well as
25
-
inference-specific routing optimizations. For more information on this resource,
26
-
refer to our [InferencePool documentation](/api-types/inferencepool).
27
-
28
-
### InferenceModel
29
-
30
-
An InferenceModel represents a model or adapter, and its associated
31
-
configuration. This resource enables you to configure the relative criticality
32
-
of a model, and allows you to seamlessly translate the requested model name to
33
-
one or more backend model names. Multiple InferenceModels can be attached to an
34
-
InferencePool. For more information on this resource, refer to our
0 commit comments