Scheduler subsystem high level design proposal #603

smarterclayton · 2025-03-28T17:44:29Z

This sets down basic design principles of the current gateway scheduler. We also highlight who we are targeting as users, and why we prioritize the current approach. It also selects standard terminology for scheduling that the implementation should adopt.

This is a high level design and thus sets general scope, without expecting to fully address all problems.

k8s-ci-robot · 2025-03-28T17:44:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: smarterclayton
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-03-28T17:44:46Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`5826cec`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67e6e62626758b000876dc66
😎 Deploy Preview	https://deploy-preview-603--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

This sets down basic design principles of the current gateway scheduler. We also highlight who we are targeting as users, and why we prioritize the current approach. It also selects standard terminology for scheduling that the implementation should adopt. This is a high level design and thus sets general scope, without expecting to fully address all problems.

kfswain · 2025-03-28T19:40:10Z

docs/proposals/006-scheduler/README.md

+
+#### Replacement Scheduler
+
+The replacement scheduler will be a low-latency mechanism for out-of-process execution of the core endpoint selection option.  The replacement scheduler will accept one or more requests to schedule, a list of endpoints, and optionally the associated informer state for those endpoints. The replacement scheduler will return one or zero endpoints per request.


You mention above:

the scheduler will consult a list of configured scorers to score the matches into a prioritized list of endpoints.

Should that just be the expected output of any scheduler?

It could allow the Personas to do cool parity comparisons, i.e. (run these 3 schedulers in parallel, but treat the reference scheduler as the source of truth, and only block on the response of that one) Which could be a useful/safe way to roll out algorithms in production traffic

Yeah, that layering is that filtering and scoring and prioritization are three phases.

Being able to run a shadow schedule is indeed valuable and aligns to the rollout use cases.

kfswain · 2025-03-28T19:41:23Z

docs/proposals/006-scheduler/README.md

+
+ A benchmarking harness will be provided to capture and reproduce a production trace, primarily to aid algorithmic contributors. A small but diverse set of production traces will be used initially to anchor expectations, and scaling both the number of supported traces and efficient regression testing at scale will be critical.
+
+ We anticipate that accelerator availability will limit the scale of e2e testing and contribution. We will develop a **model server stub** that can emulate the behavior of the core expected algorithm for model servers and does not require accelerators. We will support both time-accurate and configurable ratio emulation to allow fast execution.


kfswain · 2025-03-28T19:47:50Z

docs/proposals/006-scheduler/README.md

+## Non-Goals
+
+- Dynamic reconfiguration of the reference scheduler algorithms at runtime
+- Being a general scheduler framework for load balancing


I agree with this. I'm trying to think through how we enforce/where we draw the line. What do we consider in scope? There will be some overlap likely, but any addition/improvement should probably have an inference-specific justification.

kfswain · 2025-03-28T19:52:38Z

docs/proposals/006-scheduler/README.md

+- The scheduler should be educatable - extending the [model server protocol](../003-model-server-protocol/) with new metrics or adding a new source of data should be minimally invasive
+- The scheduler should be replaceable - the reference endpoint picker implementation should support delegating scheduling decisions per pool to an alternative **replacement scheduler**
+
+## Non-Goals


I think we should specify that a non-goal would be the concept of:

How to graduate ideas from: fork -> included in reference scheduler -> exposed as EPP config -> config expressed in our API

Something we should figure out, but not in this doc.

Something we should figure out, but not in this doc.

Something like Gateway API conformance levels?

ahg-g · 2025-03-28T19:53:34Z

docs/proposals/006-scheduler/README.md

+
+We desire the following outcomes from the reference scheduler:
+
+1. Keep model servers optimally utilized without saturating


this is mostly dictated by the qps though, the algorithm can't ensure that the model servers don't saturate

Yeah, should be “allow model servers to more predictably approach saturation” instead

ahg-g · 2025-03-28T19:54:10Z

docs/proposals/006-scheduler/README.md

+
+1. Keep model servers optimally utilized without saturating
+2. Make user-visible request latency more predictable
+3. Provide isolation between multiple workloads on the same model servers before saturation


I recommend to define saturation in the proposal

Good point, might as well open that doc up

ahg-g · 2025-03-28T20:04:13Z

docs/proposals/006-scheduler/README.md

+
+#### Replacement Scheduler
+
+The replacement scheduler will be a low-latency mechanism for out-of-process execution of the core endpoint selection option.  The replacement scheduler will accept one or more requests to schedule, a list of endpoints, and optionally the associated informer state for those endpoints. The replacement scheduler will return one or zero endpoints per request.


Why does this look like a batch scheduler?

1k qps will require either streaming or batching, fair point we don’t want to constrain that design yet

ahg-g · 2025-03-28T20:06:11Z

docs/proposals/006-scheduler/README.md

+
+Given that we anticipate a significant amount of future work to integrate heterogenous hardware (different generations / topologies) and heterogeous server roles (prefill-heavy, prefill/decode split, latency objectives), we expect that there will be an **assignment** informer that partitions the candidate endpoints over multiple dimensions for the scheduler.  This will decouple the scheduling algorithm from the process of determining the capacity and suitability of different model servers to different dimensions of request cost.
+
+#### Replacement Scheduler


I am not following what this section is proposing, what do we mean by replacement? replacing the reference scheduler? why is this proposal trying to define that?

It’s kind of defined above, but implementing a whole EPP is a lot, and i don’t want to duplicate everything in another language.

I guess my question is: are we proposing to build a second reference scheduler in this repo?

Just the tools to do so: I'm gonna clean up & finish implementing: https://github.com/kfswain/go-py-interface/tree/main (it's hideous and incomplete, which is why I've been so cagey about it)

So the Replacement Scheduler could be a Python based scheduler, that is called via an EPP fork. If we create a simple ingress interface:

endpoint map w/metrics

config params

And a simple egress interface:

scored endpoints

maybe specify how many endpoints to duplicate to? (was mentioned as a potential need by prodstack)

implementing a new algo in Python should be straightforward

shaneutt · 2025-04-02T14:01:57Z

docs/proposals/006-scheduler/README.md

+## Summary
+
+This proposal defines the inference gateway scheduler subsystem and constrains its scope. The scheduler is responsible for determining which endpoints the load balancer should route requests to. Future proposals may extend its scope.
+


We should add a motivation section here. We say what we want to deliver. We seem to need a little more clarity on why we want to deliver that. To my knowledge one of the primary motivating factors here is to avoid sub-optimal routing decisions based on rules (e.g. InferenceModel).

Hrm, I don't know that I would describe it like that. I probably would have said:

The inference gateway leverages insight into the anticipated cost of a request and a dynamic capacity model of the backend to achieve higher utilization and more predictable response latency than random balancing can achieve. It should accomplish this over multiple optimization dimensions including but not limited to: prompt length, anticipated output length, current traffic distribution, available backend kv-cache, workload latency objective, anticipated savings from prefix cache aware routing, heterogenous accelerator performance, and backend topology (such as prefill disaggregation or different model server tuning). This unified model can better serve diverse workloads on shared models with fewer accelerators as it is reactive to current traffic rather than defined up front.

@smarterclayton ^ is awesome and should be documented. Do you think it's appropriate to add ^ to https://gateway-api-inference-extension.sigs.k8s.io/#introduction?

danehans · 2025-04-08T22:09:27Z

docs/proposals/006-scheduler/README.md

+
+## Goals
+
+- The scheduler should be reasonably fast - decide request mapping to endpoints within O(10ms) on average


Do any scheduler benchmarks exist that can be referenced for the O(10ms) target?

A very short LLM request returning 4 tokens could in theory complete in 100ms, so less than 10% overhead relative to serving.

It might be worth adding ^ or something similar to provide context for the O(10ms) target.

danehans · 2025-04-08T22:11:04Z

docs/proposals/006-scheduler/README.md

+- The scheduler should be reasonably fast - decide request mapping to endpoints within O(10ms) on average
+- The scheduler should be effective - requiring little configuration out of the box to get great performance
+- The scheduler should be maintainable - new in-tree features should compose cleanly
+- The scheduler should be forkable - downstream consumers should expect some stability of interface


s/interface/the interface/

danehans · 2025-04-08T22:12:06Z

docs/proposals/006-scheduler/README.md

+- The scheduler should be effective - requiring little configuration out of the box to get great performance
+- The scheduler should be maintainable - new in-tree features should compose cleanly
+- The scheduler should be forkable - downstream consumers should expect some stability of interface
+- The scheduler should be educatable - extending the [model server protocol](../003-model-server-protocol/) with new metrics or adding a new source of data should be minimally invasive


Consider s/educatable/extensible/

danehans · 2025-04-08T22:16:45Z

docs/proposals/006-scheduler/README.md

+
+- Fix a scheduler bug OR Extend the reference scheduler with changes specific to their environment
+- Quickly deploy a custom EPP with their changes to their environment, and sustain that fix until upstream merges
+- Add new test cases to validate their issue is resolved and does not regress


s/regress/introduce a regression/

danehans · 2025-04-08T22:21:54Z

docs/proposals/006-scheduler/README.md

+We desire the following outcomes from the act of using a replacement scheduler:
+
+1. Fast iteration with the ML ecosystem, namely other languages
+2. Benefit from existing informers without having multiple implementations


Can you clarify what you mean by "existing informers"?

danehans · 2025-04-08T22:23:38Z

docs/proposals/006-scheduler/README.md

+
+We expect the following challenges to be addressed by the reference scheduler design:
+
+1. Understand the cost of an incoming request and its impact on the target model server before placing it


Can you clarify what "cost" means, e.g. is this orca?

Cost in this context would be a reasonably accurate estimate of the amount of resources that a given request will consume from the pool, and for how long?

Ex: This 200 token prompt that has no prefix cache hit is projected to have 456 output tokens and so will take up X amount of GPU memory, and should take ~Y time to complete

@kfswain thanks for the feedback. It would be helpful to include the context and example you provided in the doc.

hzxuzhonghu · 2025-04-23T03:50:51Z

docs/proposals/006-scheduler/README.md

+## Non-Goals
+
+- Dynamic reconfiguration of the reference scheduler algorithms at runtime
+- Being a general scheduler framework for load balancing


I ca n understand this is tageted to be a generic inference scheduling framework instead of common web service

danehans

@smarterclayton, What are the next steps to move this PR forward? Does the doc change at all now that #677 has merged?

danehans · 2025-04-24T22:00:21Z

docs/proposals/006-scheduler/README.md

+## Summary
+
+This proposal defines the inference gateway scheduler subsystem and constrains its scope. The scheduler is responsible for determining which endpoints the load balancer should route requests to. Future proposals may extend its scope.
+


@smarterclayton ^ is awesome and should be documented. Do you think it's appropriate to add ^ to https://gateway-api-inference-extension.sigs.k8s.io/#introduction?

danehans · 2025-04-24T22:02:11Z

docs/proposals/006-scheduler/README.md

+
+## Goals
+
+- The scheduler should be reasonably fast - decide request mapping to endpoints within O(10ms) on average


It might be worth adding ^ or something similar to provide context for the O(10ms) target.

danehans · 2025-04-24T22:05:45Z

docs/proposals/006-scheduler/README.md

+- The scheduler should be educatable - extending the [model server protocol](../003-model-server-protocol/) with new metrics or adding a new source of data should be minimally invasive
+- The scheduler should be replaceable - the reference endpoint picker implementation should support delegating scheduling decisions per pool to an alternative **replacement scheduler**
+
+## Non-Goals


Something we should figure out, but not in this doc.

Something like Gateway API conformance levels?

nirrozenbaum · 2025-04-25T05:41:54Z

docs/proposals/006-scheduler/README.md

+-   [Proposal](#proposal)
+    -   [Personas](#personas)
+    -   [Requirements](#requirements)
+    -   [Design](#design)


do we want to describe in this doc how to configure scheduler during build time in a fork?
or is it too much details for this kind of a doc?

k8s-ci-robot requested review from danehans and Jeffwan March 28, 2025 17:44

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 28, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2025

smarterclayton force-pushed the docs branch from 37eec8d to 5826cec Compare March 28, 2025 18:10

kfswain reviewed Mar 28, 2025

View reviewed changes

ahg-g reviewed Mar 28, 2025

View reviewed changes

shaneutt reviewed Apr 2, 2025

View reviewed changes

danehans reviewed Apr 9, 2025

View reviewed changes

hzxuzhonghu reviewed Apr 23, 2025

View reviewed changes

danehans reviewed Apr 24, 2025

View reviewed changes

nirrozenbaum reviewed Apr 25, 2025

View reviewed changes


		#### Replacement Scheduler

		The replacement scheduler will be a low-latency mechanism for out-of-process execution of the core endpoint selection option. The replacement scheduler will accept one or more requests to schedule, a list of endpoints, and optionally the associated informer state for those endpoints. The replacement scheduler will return one or zero endpoints per request.


		A benchmarking harness will be provided to capture and reproduce a production trace, primarily to aid algorithmic contributors. A small but diverse set of production traces will be used initially to anchor expectations, and scaling both the number of supported traces and efficient regression testing at scale will be critical.

		We anticipate that accelerator availability will limit the scale of e2e testing and contribution. We will develop a model server stub that can emulate the behavior of the core expected algorithm for model servers and does not require accelerators. We will support both time-accurate and configurable ratio emulation to allow fast execution.


		We desire the following outcomes from the reference scheduler:

		1. Keep model servers optimally utilized without saturating


		Given that we anticipate a significant amount of future work to integrate heterogenous hardware (different generations / topologies) and heterogeous server roles (prefill-heavy, prefill/decode split, latency objectives), we expect that there will be an assignment informer that partitions the candidate endpoints over multiple dimensions for the scheduler. This will decouple the scheduling algorithm from the process of determining the capacity and suitability of different model servers to different dimensions of request cost.

		#### Replacement Scheduler

		## Summary

		This proposal defines the inference gateway scheduler subsystem and constrains its scope. The scheduler is responsible for determining which endpoints the load balancer should route requests to. Future proposals may extend its scope.


		## Goals

		- The scheduler should be reasonably fast - decide request mapping to endpoints within O(10ms) on average


		We expect the following challenges to be addressed by the reference scheduler design:

		1. Understand the cost of an incoming request and its impact on the target model server before placing it

Scheduler subsystem high level design proposal #603

Are you sure you want to change the base?

Scheduler subsystem high level design proposal #603

Conversation

smarterclayton commented Mar 28, 2025 • edited Loading

k8s-ci-robot commented Mar 28, 2025

netlify bot commented Mar 28, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfswain Mar 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Mar 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Mar 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danehans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Mar 28, 2025 •

edited

Loading

netlify bot commented Mar 28, 2025 •

edited

Loading

kfswain Mar 28, 2025 •

edited

Loading

ahg-g Mar 28, 2025 •

edited

Loading

ahg-g Mar 28, 2025 •

edited

Loading