Skip to content

Complete the InferencePool documentation #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 23, 2025
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
ca9a110
Initial guide for inference pool
nicolexin Apr 9, 2025
66c0860
Add extensionReference to the InferencePool spec
nicolexin Apr 9, 2025
c4abfe5
Fix list formatting
nicolexin Apr 9, 2025
63a60a2
Remove unused labels
nicolexin Apr 9, 2025
3ebb03f
Autogenerate the spec
nicolexin Apr 10, 2025
fac37a9
Update site-src/api-types/inferencepool.md
nicolexin Apr 11, 2025
5953c37
Update site-src/api-types/inferencepool.md
nicolexin Apr 11, 2025
41fa708
Update site-src/api-types/inferencepool.md
nicolexin Apr 11, 2025
c22d962
Update site-src/api-types/inferencepool.md
nicolexin Apr 11, 2025
bbda2b5
Update site-src/api-types/inferencepool.md
nicolexin Apr 11, 2025
6848995
Update site-src/api-types/inferencepool.md
nicolexin Apr 11, 2025
1c14218
Rename llm-pool names in rollout example
nicolexin Apr 11, 2025
68a49a2
Add use cases for replacing an inference pool
nicolexin Apr 11, 2025
3e4285a
Rewording the background section
nicolexin Apr 11, 2025
a2945e9
Create replacing-inference-pool.md
nicolexin Apr 11, 2025
7520f6f
Replace instructions with a link for how to replace an inference pool
nicolexin Apr 11, 2025
860e121
Update replacing-inference-pool.md
nicolexin Apr 11, 2025
53e16b8
Update mkdocs.yml
nicolexin Apr 11, 2025
ca7e02e
Update replacing-inference-pool.md
nicolexin Apr 11, 2025
4127b08
Merge branch 'kubernetes-sigs:main' into inferencepool-ref
nicolexin Apr 16, 2025
85c9311
Update inferencemodel_types.go
nicolexin Apr 16, 2025
cce8c0b
Merge branch 'kubernetes-sigs:main' into inferencepool-ref
nicolexin Apr 18, 2025
ee580da
Update inferencepool.md
nicolexin Apr 18, 2025
4fdfadf
Update site-src/guides/replacing-inference-pool.md
nicolexin Apr 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ nav:
- Getting started: guides/index.md
- Adapter Rollout: guides/adapter-rollout.md
- Metrics: guides/metrics.md
- Replacing an Inference Pool: guides/replacing-inference-pool.md
- Implementer's Guide: guides/implementers.md
- Performance:
- Benchmark: performance/benchmark/index.md
Expand Down
58 changes: 43 additions & 15 deletions site-src/api-types/inferencepool.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,56 @@

## Background

The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin.
The **InferencePool** API defines a group of Pods (containers) dedicated to serving AI models. Pods within an InferencePool share the same compute configuration, accelerator type, base language model, and model server. This abstraction simplifies the management of AI model serving resources, providing a centralized point of administrative configuration for Platform Admins.

It is expected for the InferencePool to:
An InferencePool is expected to be bundled with an [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) extension. This extension is responsible for tracking key metrics on each model server (i.e. the KV-cache utilization, queue length of pending requests, active LoRA adapters, etc.) and routing incoming inference requests to the optimal model server replica based on these metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an EPP be associated with multiple pools?
Can a Pod belong to more than one pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes EPP is associated with the inferencepool, so it is associated with multiple pods covered by this inference pool. You can have one pod belong to multiple pools.

Copy link
Contributor

@elevran elevran Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nicolexin.
I wanted to clarify that a single EPP can be associated with multiple InferencePools, not Pods.
Is the mental model for Pool and EPP supporting that and what changes would need to be done in the EPP to support dispatching to the right Pool (e.g., based on the model name present in the request body)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry about misreading your first question. Yes a single EPP can be associated with multiple InferencePools.

The mental model depends on how you define your gateway, for example:

  1. You can configure two separate gateways with different frontend IP, each backed by a different inference pool. Both of the pool sharing the same EPP. In this case if you hit one of the gateway IP, the callout extension should only send available endpoints belong to that gateway to the EPP, and EPP will choose an optimal endpoint based on that.
  2. Similarly, you could have one gateway but separate HTTPRoute/rules, so that on one path (e.g. /v1) requests go to the v1 inference pool, and on the other path (e.g. /v2) requests go to the v2 inference pool. Again in this case the callout extension should only send available endpoints belong to that particular routing rules to the EPP based on where the request was sent.
  3. If there are multiple inference pools sharing the same serving path then EPP will received a list of endpoints belong to all of these inference pools. EPP needs to probe each of the model server endpoints for available model names and choose one of the endpoints based on the model name present in the request body. (Note that this is a common use case even if you want to support just one InferencePool with a common based model + LoRA adapters)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes a single EPP can be associated with multiple InferencePools.

An EPP can only be associated with a single InferencePool. The associated InferencePool is specified by the poolName and poolNamespace flags. An HTTPRoute can have multiple backendRefs that reference the same InferencePool and therefore routes to the same EPP. An HTTPRoute can have multiple backendRefs that reference different InferencePools and therefore routes to different EPPs. The implementation guide states:

InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them.

The " Inference-focused Pods" define the pool of model servers that share common attributes, and "an extension" defines the EPP responsible for picking which model server in the pool the request should be routed to.

@nicolexin this question has surfaced multiple times recently, so you may want to include documentation in this PR that resolves this confusion.

xref: #145

Copy link
Collaborator

@kfswain kfswain Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-tenancy comes up a lot. I think at minimum we should be very cautious about it, and potentially even advise against it.

Multi-tenancy would make a single point of failure for multiple sets of accelerators(pools), so an outage would be incredibly painful (loss of revenue + cost of essentially all their expensive accelerators).

Multi-tenancy also puts pressure on any scale issues we may come across, which we currently do not have a good read on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but just by looking at our InferencePool API, I don't think other implementations are restricted by one EPP associated with a single inference pool.

PTAL at #145 referenced above. InferencePool can only reference a single EPP config, e.g. Service. On the EPP side, it can only take a single poolName and poolNamespace flag. This is a singular bi-directional binding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the EPP side, it can only take a single poolName and poolNamespace flag

I'd argue that this is an implementation detail of EPP, and not necessarily a long term or intentional limitation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that this is an implementation detail of EPP, and not necessarily a long term or intentional limitation.

I agree, and that's why I created #252 to provide additional flexibility for the EPP<>InferencePool(s) binding. My comments are specific to what is capable today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the revised wording is great here, thanks @nicolexin! It's reflective of what we have today, and we can use #252 to follow up on more flexible mapping. I think everything else has been resolved, and this is a huge improvement to our docs, so I think we should go ahead and merge.


- Enforce fair consumption of resources across competing workloads
- Efficiently route requests across shared compute (as displayed by the PoC)

It is _not_ expected for the InferencePool to:
Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Endpoint Picker has adequate information to intelligently route requests.

- Enforce any common set of adapters or base models are available on the Pods
- Manage Deployments of Pods within the Pool
- Manage Pod lifecycle of pods within the pool
## How to Configure an InferencePool

Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).

`InferencePool` has some small overlap with `Service`, displayed here:
In summary, the InferencePoolSpec consists of 3 major parts:

- The `selector` field specifies which Pods belong to this pool. The labels in this selector must exactly match the labels applied to your model server Pods.
- The `targetPortNumber` field defines the port number that the Inference Gateway should route to on model server Pods that belong to this pool.
- The `extensionRef` field references the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) (EPP) service that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions.

### Example Configuration

Here is an example InferencePool configuration:

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will lead to better code highlighting:

Suggested change
```
```yaml

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: vllm-llama3-8b-instruct
spec:
targetPortNumber: 8000
selector:
app: vllm-llama3-8b-instruct
extensionRef:
name: vllm-llama3-8b-instruct-epp
port: 9002
failureMode: FailClose
```

In this example:

- An InferencePool named `vllm-llama3-8b-instruct` is created in the `default` namespace.
- It will select Pods that have the label `app: vllm-llama3-8b-instruct`.
- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped.
- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped due to "FailClose" being configured as the `failureMode`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a few of these comments got hidden by GitHub UI, hopefully this helps keep them from disappearing.

- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods.
- Traffic routed to this InferencePool will be forwarded to port `8000` on the selected Pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a few of these comments got hidden by GitHub UI, hopefully this helps keep them from disappearing.


## Overlap with Service

**InferencePool** has some small overlap with **Service**, displayed here:

<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
<img src="/images/inferencepool-vs-service.png" alt="Comparing InferencePool with Service" class="center" width="550" />

The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management.

## Spec
The InferencePool is not intended to be a mask of the Service object. It provides a specialized abstraction tailored for managing and routing traffic to groups of LLM model servers, allowing Platform Admins to focus on pool-level management rather than low-level networking details.

The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).
## Replacing an InferencePool
Please refer to the [Replacing an InferencePool](/guides/replacing-inference-pool) guide for details on uses cases and how to replace an InferencePool.
59 changes: 59 additions & 0 deletions site-src/guides/replacing-inference-pool.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Replacing an InferencePool

## Background

Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary.

## Use Cases
Use Cases for Replacing an InferencePool:

- Upgrading or replacing your model server framework
- Upgrading or replacing your base model
- Transitioning to new hardware

## How to replace an InferencePool

To replacing an InferencePool:

1. **Deploy new infrastructure**: Create a new InferencePool configured with the new hardware / model server / base model that you chose.
1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool and the new InferencePool. The `backendRefs.weight` field controls the traffic percentage allocated to each pool.
1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions.
1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary.

### Example

You start with an existing lnferencePool named `llm-pool-v1`. To replace the original InferencePool, you create a new InferencePool named `llm-pool-v2`. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and new `llm-pool-v2`.

1. Save the following sample manifest as `httproute.yaml`:

```
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: llm-route
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: inference-gateway
rules:
backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: llm-pool-v1
weight: 90
- group: inference.networking.x-k8s.io
kind: InferencePool
name: llm-pool-v2
weight: 10
```

1. Apply the sample manifest to your cluster:

```
kubectl apply -f httproute.yaml
```

The original `llm-pool-v1` InferencePool receives most of the traffic, while the `llm-pool-v2` InferencePool receives the rest.

1. Increase the traffic weight gradually for the `llm-pool-v2` InferencePool to complete the new InferencePool roll out.
Loading