Skip to content

Initial Scheduler Subsystem interface #845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 30, 2025
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 1 addition & 26 deletions docs/proposals/0683-epp-architecture-proposal/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,9 @@
# Gateway API Inference Extension
# EPP Architecture Proposal

Author(s): @kfswain
## Proposal Status
***Draft***

## Table of Contents

<!-- toc -->

- [Summary](#summary)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Personas](#personas)
- [Inference Platform Admin](#inference-platform-admin)
- [Inference Workload Owner](#workload-owner)
- [Axioms](#axioms)
- [InferencePool](#inferencepool)
- [InferenceModel](#inferencemodel)
- [Spec](#spec)
- [Diagrams](#diagrams)
- [Alternatives](#alternatives)
- [Open Questions](#open-questions)

<!-- /toc -->

## Summary

This proposal seeks to standardize the implementation of an EPP (End-point Picker) for the Inference Gateway extension (also known as Gateway API Inference Extension). Additionally, this proposes to restructure the current implementation of the EPP to be more modular, and approachable.
Expand Down Expand Up @@ -86,11 +65,7 @@ Due to the possibility of this becoming a bit of a dumping ground. The API will

The flow controller will consume resource regime data, and enforce proper resource sharing between workloads. This will primarily be done through a queuing mechanism [as described here](https://docs.google.com/document/d/1VZL7opFWuwgWquvgiOzLlXAJ633qZ9U-A0ZixGjBgaI/edit?usp=sharing).

#### Scheduling Layer

As the Scheduling Layer is the final interface to the entirety of the pool, all configuration will be at the _pool_ level. The default scheduling layer will be an experimentally-backed LB algorithm, with exposed config values.

The Scheduler will define a strong interface API, so that new scheduling algos may be plugged & dark-launched to test in production traffic without impacting said traffic. Extension is expected to adhere to the [Scheduler Subsystem definition](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/603)

### `Non-extensible`

Expand Down
79 changes: 79 additions & 0 deletions docs/proposals/0845-scheduler-architecture-proposal/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
**Note This is a work in progress proposal, it is not in its final state.**

# Scheduling Subsystem Architecture

Author(s): @kfswain, @ahg-g, @nirrozenbaum
## Proposal Status
***Draft***

## Summary
The Scheduling Subsystem is a framework used to implement scheduling algorithms. High level definition [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/006-scheduler) & EPP Architecture [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).

## Design Principles
- The scheduler framework should act as an independent library, there should be no dependency on EPP packages defined outside of the scheduler
- The *framework* should be agnostic to web protocols(such as HTTP), endpoint types (such as model servers), and K8s concepts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The *framework* should be agnostic to web protocols(such as HTTP), endpoint types (such as model servers), and K8s concepts.
- The *framework* should be agnostic to web protocols (such as HTTP), endpoint types (such as model servers), and K8s concepts.

This is a good goal to strive for. Note that the use of HTTP headers to pass information around might not let us achieve it (from a "purist" point of view...).
Q: is "endpoint type" decoupled from vLLM as the engine (kind of, due to specifying MSP)? Do we have endpoint types which are not represented uniquely by an IP (e.g., a single IP with multiple engine instances, each serving a different GPU)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the use of HTTP headers to pass information around might not let us achieve it (from a "purist" point of view...).

I'm intentionally pushing any HTTP header context outside of the scheduler subsystem and into the request control (I need to update the arch diagram to include request control, wont be able to do so till I return from leave). But this should be very feasible within the scheduler.

Q: is "endpoint type" decoupled from vLLM as the engine (kind of, due to specifying MSP)? Do we have endpoint types which are not represented uniquely by an IP (e.g., a single IP with multiple engine instances, each serving a different GPU)?

In the context of this subsystem, yes, endpoint it entirely agnostic to what the endpoint maps too, it is up to the system calling the scheduler to care about what the endpoint maps to. In fact, I'm considering having the scheduler system not be aware of IP/Port/Networking metadata, as: for the scope of the scheduler, a simple uniquely identifying string suffices for selection. What is done with that selection & how it's used in routing it out of scope of this component. That's my thinking anyway

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need to know IP address (and port) to encode the P/D information. The selection of P and D targets is known inside the scheduler and I don't think it should "leak" outside. While the primary endpoint is an agreed output and thus can be encoded as part of the EPP protocol with the gateway implementation, we need to be able to encode the secondary (or other) selection as HTTP headers or find a way to communicate it from Scheduler to request control in a standard way.

- Opinions should be held by the plugins, not the framework
- The entry & exit points should be defined by the framework, acting as the API surface of the system
- Multiple scheduling 'profiles' should be able to be ran for a single request.
- They can be conditionally dependent on previous runs, or in parallel
- Plugin state is managed by the plugin itself
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at which point does "plugin state" become "system state"?
For example, if a plugin needs data not currently available in the system, is responsible for collecting and storing it internally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will be something we will need to tune. But to answer:

For example, if a plugin needs data not currently available in the system, is responsible for collecting and storing it internally?

Yes, for now, the thinking is its the responsibility of the plugin to collect unique data; as that will make out of tree plugins easier to support.

But I also have been toying with the argument of: It is entirely not the scheduler subsystems responsibility to collect & store data, and that should be handled by another system. But I worry that would tightly couple a data collection system to plugins too much. as plugins need some form of data to function properly. But a separate data store per plugin is also kind of an anti-pattern imo.

Ultimately, I think this is a grey area and we probably need to support plugins having their own data store, but advise that a central datastore should be the way endpoint data is handled/stored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is entirely not the scheduler subsystems responsibility to collect & store data, and that should be handled by another system.

Agree.
Need to experiment to test out options and get a better feel for coupling and alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think three types of state will be sufficient:

  1. State per request. This is managed by what we are calling CycleState and its lifecycle is tied to the request.
  2. State managed by the plugin struct itself. The lifecycle of this state is tied to the plugin, and since plugins will be instantiated once, it is a state that plugins can use across requests (like prefix-cache index).
  3. State managed by the data layer. I am thinking that each endpoint will be associated with state (currently metrics) that a data layer plugin can add to it. A data layer plugin could be one that extracts an additional metric for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++. agreed.
I think we also need to clarify that CycleState is also passed between profile runs during the lifecycle of request.


## Definitions
- **Scheduling Framework** - The system created to allow for a pluggable scheduling algorithm.
- **Scheduling Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints.
- **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduling Profiles, the Scheduling Profiles themselves, & logic to interpret the result.
- **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling Cycle - A single run of a Scheduler * Profile *

- **Plugin** - Implementation of framework-defined interface(s) to add or extend logic across the framework.

## Proposal

The Scheduling System draws inspiration from the kube-schedulers pluggable system, though there are distinct differences in goals/usage.

The Scheduling System can loosely be defined into 3 sections:
- A *framework* to implement the system
- The *interfaces* that a consumer can use to extend the system
- A *configuration API* to define the Scheduler, Profile(s), & the plugins used within those profiles

A sketch of the System, with extension points is here:
<img src="./images/scheduler_subsystem.svg" alt="Scheduling Algorithm" width="1000" />

Describing the interface extension points & flow is the simplest way to convey the intent of what the framework should enable:

### PreSchedule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProfileSelect? ProfilePicker? SelectProfiles?
This should be a descriptive name. see the last diagram I uploaded in the comments


PreSchedule is the entry point into the scheduling cycle (called by the framework). PreSchedule, selects profiles conditionally based on:

- Request data
- Results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results of what?
Results of previously executed SchedulingProfile cycles

- Cycle State

PreSchedule will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single PreSchedule function may be defined per scheduler.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace continuously with iteratively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: should it also return some context for running the profiles? For example, can profiles be run in parallel or not, should the scheduler call into the plugin or not, etc.
We can either make a statement about behavior (e.g., always run sequentially in the order returned, call repeatedly until no profiles are returned), or leave it for user control.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: should it also return some context for running the profiles? For example, can profiles be run in parallel or not, should the scheduler call into the plugin or not, etc.

can update the readme to clarify, but the expectation would be that any profiles returned in a single PreSchedule call would be able to be ran in parallel by the framework.

I originally had a bool in the function signature to tell the framework that PreSchedule should be called again. I removed it with the implicit assumption that PreSchedule would be called until no profiles were returned, but I actually think I prefer the bool method, it is more explicit & prevents an unnecessary final run of PreSchedule, which leaves that to user control, as you mention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can update the readme to clarify, but the expectation would be that any profiles returned in a single PreSchedule call would be able to be ran in parallel by the framework.

I think this would lead to using one profile per call.

but I actually think I prefer the bool method, it is more explicit & prevents an unnecessary final run of PreSchedule

I prefer the explicit option as well.


### Profile Cycle

The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick`

*Profile Constraints*
- A profile can have any number of `Filter` plugins registered (including zero)
- A profile can have any number of `Score` plugins registered (including zero)
- A profile MUST have exactly one `Pick` plugin registered


#### Filter
Filter runs before any scoring, and remove endpoints that are not fit for selection. The framework will return an error to the client if the endpoints are filtered to zero.

#### Score
Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulingProfile configuration level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate issue: weights and range.

Is it possible to define a scoring function that is evaluated to prefix coverage - load?

  • can you have negative weights?
  • do higher scores signify a "better fit" of the Pod ?
  • any guidance with respect to weights (i.e., relative "strength" for scorers)?

I'd appreciate pointers to discussion/decisions relating to scoring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions.

Many of these would be up to user implementation, since they can also implement the Pick() interface. But, as designed I would say:

can you have negative weights?

I would advise against negative weights, and instead used fractional scores (we would need to change weights to float type to match)

do higher scores signify a "better fit" of the Pod ?

The default picker would expect a higher score to imply better fit, yes.

any guidance with respect to weights (i.e., relative "strength" for scorers)?

Right, weights would be the way for a scheduling algo to give greater import to certain scorers over others. Which is why normalized score values from the Scorer implementation is expected, leave it to scheduler-specific opinion to set the weight(level of significance) of a given scorer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kube-scheduler we simply used a range from 0-100, the higher the better the fit. The score evaluates the different backends from the perspective of a single scorer. Weights is the knob to determine how scores across plugins compare, this separation allows score plugins to implement their scoring algorithm independently from each other. Weights is a tunable parameter for the platform admin, but we need to offer defaults that work well out of the box (higher weights to prefix scores for example)


#### Pick
Picker selects the endpoint(s) from the provided list of scored endpoints. Picker MUST return, one endpoint at minimum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior (or - are there any assumptions) when more than one is returned?
This should be guidance to Picker implementations as well as the scheduler's exit plugins (running on the result of all cycles).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so that would be based on the PostSchedule/ResultAggregation implementation. We can document more of the expectations of the default implementation (i.e. they would be treated like fallback endpoints). But that might be documentation around that specific implementation, and not necessarily this interface. But open to other opinions here.



### PostSchedule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to a descriptive name, for example ProcessProfileResults (as specified in the diagram in my last comment)

PostSchedule recieves the output of the result(s) of the scheduling cycle(s) and makes sense of the data to be consumed by the calling system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added in the diagram in my last comment a "true" PostSchedule, that happens after scheduler returns, per Abdullah's comment in one of the open conversations.

we can merge the two OR we can keep two extension points, one of MultiProfilePlugin ProcessProfileResults which can merge ProfileResults, change selection of a specific cycle to a more global optimized result, decide to only log specific cycle, etc... and the PostSchedule (outside of the scheduler), that handles putting the headers in the request (or other custom logic). I'm not sure it should be the same plugin.

### PostResponse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not feel like a natural Scheduling plugin.
Most if not all of scheduling deals with a request, not a response (which happens post scheduling and executing the scheduling decision). Nothing prevents a module from registering plugin for the scheduler as well as other system extension points outside the scheduler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I really don't like including this in the scope of the scheduling system.

But based on feedback, there is a desire for per-plugin datastore management as discussed in other comments. I would be very happy to remove this, but that gets back in to the fact that doing so may highly couple the data system to the scheduling system. Perhaps that is unavoidable.

PostResponse is a special case extension that can optionally be implemented by a plugin that needs to augment its state based on response or request data. This should only be implemented for plugins that need to update state outside of the scheduling cycle. PostResponse is ran at the time of processing a response.
Comment on lines +73 to +74
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should specify that this is outside of the scheduler.
it's not clear from the text as this doc is focused on Scheduler subsystem.


## ConfigurationAPI
TODO
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#names are egregiously long, but attempting to descibe custom logic within a name
profileSelection: disagg-token-length
schedulingResult: log-shadowbox-label-pd-result
profiles:
prefill:
preschedule:
- decode-prefix-cache-check
filter:
- is-prefill
- has-required-accelerator
score:
- prefix-cache: 3
- latency-scorer: 2
selection:
- best-score
postschedule:
- log-full-scores
decode:
filter:
- is-decode
score:
- prefix-cache: 3
- kv-cache-util: 5
selection:
- random-top-3
shadowbox-decode:
filter:
- is-decode
- is-tpu
score:
- prefix-cache-v2: 4
- kv-cache-util: 1
selection:
- random-top-3
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
/*
Copyright 2025 The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package framework

import (
"context"

scheduling "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types"
)

// READER NOTE: Currently CycleState is assumed to have appropriate request data rather that making a new object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I understood, both @ahg-g and myself agree this should be passed as an argument and not in cycle state.


// Plugin is the parent type for all the scheduling framework plugins.
type Plugin interface {
Name() string
}

type Endpoint struct {
State EndpointState
Score float64
}

type EndpointState struct {
// storage is per Scheduling Cycle, and so has no thread-safe concerns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thread-safe as long as we run all filters/scorers/picker sequentially on the same go routine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, should make a note of that, that's how I would expect the framework to implement the scheduling cycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... that's how I would expect the framework to implement the scheduling cycle

Agree within a single cycle, but what happens when more than one profile is returned?

But returning more than one profile at a time you had specified that they can run in parallel...
Nothing prevents a plugin instance (such as prefix) to be used on more than one cycle (e.g., it is used on both P and D),
I would prefer that we opt for sequential run-to-completion of each profile cycle. When multiple are returned, they should execute sequentially in the order returned.
I don't think there's much to gain from parallel runs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't think there is a case that requires running profiles in "parallel". My preference is to return a single profile and a flag to indicate whether or not to pick and run another profile.

storage map[string]any
}

type SchedulingResult struct {
results map[string][]Endpoint
}
Comment on lines +42 to +44
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to wrap a map with a struct?
what are the downsides of returning a map?


// Scheduler is the implementation of a... scheduler.
// The scheduler object is created at startup using the provided configuration.
type Scheduler interface {
// PreSchedule selects scheduling profiles through the implemented
// logic, and returns:
// - profiles - A subset of the registered scheduling profiles to be ran
PreSchedule(request map[string]any, data scheduling.CycleState, results map[string][]Endpoint) map[string]SchedulingProfile

// PostSchedule recieves the output of the result(s) of the scheduling cycle(s)
// and makes sense of the data to be consumed by the calling system.
// For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
// PostSchedule would know to simply log the result of ShadowBoxing
// profile, and do nothing else with it.
PostSchedule(profileResults map[string][]Endpoint) SchedulingResult
}
Comment on lines +46 to +60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point (1) - I'd like to suggest using a more descriptive name. Additionally, this should be a plugin.
point (2) - I think it would be good to use a result struct rather than []Endpoint. we might find out that we need more than just an array of endpoints (e.g., maybe we need array of equally best endpoints and additional array of backup endpoints with very close lower scores?)

Suggested change
// Scheduler is the implementation of a... scheduler.
// The scheduler object is created at startup using the provided configuration.
type Scheduler interface {
// PreSchedule selects scheduling profiles through the implemented
// logic, and returns:
// - profiles - A subset of the registered scheduling profiles to be ran
PreSchedule(request map[string]any, data scheduling.CycleState, results map[string][]Endpoint) map[string]SchedulingProfile
// PostSchedule recieves the output of the result(s) of the scheduling cycle(s)
// and makes sense of the data to be consumed by the calling system.
// For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile.
// PostSchedule would know to simply log the result of ShadowBoxing
// profile, and do nothing else with it.
PostSchedule(profileResults map[string][]Endpoint) SchedulingResult
}
type Result struct {
Endpoints []Endpoint // set of selected endpoints
}
// MultiProfilePlugin defines the interface for handling multi profile scheduling.
type MultiProfilePlugin interface {
Plugin
// selects the SchedulingProfiles to run while taking into consideration the request properties
// and the previously executed SchedluderProfile cycles along with their results.
SelectProfiles(request map[string]any, data scheduling.CycleState, executionResults map[string]*types.Result) map[string]*SchedulerProfile
// ProcessProfileResults handles the outcome of each selected profile.
// It may aggregate results, log test profile outputs, or apply custom logic.
ProcessProfileResults(request map[string]any, results map[string]*types.Result) map[string]*types.Result
}


// SchedulingProfile is used to describe a profile that will
// run for a given scheduling cycle.
type SchedulingProfile struct {
// Name of the profile.
Name string
// Filters lists all Filter plugins associated with this Profile. Filters
// are optional.
Filters []Filter
// Scorers lists all Score plugins associated with this Profile. Scorers
// are optional.
Scorers map[Scorer]int
// Picker returns the function that picks the endpoint(s). Picker is required.
Picker Picker
}

// Filter runs before any scoring, and remove endpoints that are not fit for
// selection. The framework will return an error to the client if the endpoints
// are filtered to zero.
type Filter interface {
Plugin
Filter(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
}

// Scorer applies a score to each remaining endpoint provided. Scorers SHOULD
// keep their score values in a normalized range: [0-1]. Any weighting should
// be added at the SchedulingProfile configuration level.
type Scorer interface {
Plugin
Score(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
}

// Picker selects the endpoint(s) from the provided list of scored endpoints.
// Picker MUST return, one endpoint at minimum.
type Picker interface {
Plugin
Pick(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
}

type PostResponse interface {
Plugin
PostResponse(ctx context.Context, request map[string]any, response map[string]any)
}