diff --git a/docs/proposals/0683-epp-architecture-proposal/README.md b/docs/proposals/0683-epp-architecture-proposal/README.md index 48c7720fb..7bd688c73 100644 --- a/docs/proposals/0683-epp-architecture-proposal/README.md +++ b/docs/proposals/0683-epp-architecture-proposal/README.md @@ -1,30 +1,9 @@ -# Gateway API Inference Extension +# EPP Architecture Proposal Author(s): @kfswain ## Proposal Status ***Draft*** -## Table of Contents - - - -- [Summary](#summary) -- [Goals](#goals) -- [Non-Goals](#non-goals) -- [Proposal](#proposal) - - [Personas](#personas) - - [Inference Platform Admin](#inference-platform-admin) - - [Inference Workload Owner](#workload-owner) - - [Axioms](#axioms) - - [InferencePool](#inferencepool) - - [InferenceModel](#inferencemodel) - - [Spec](#spec) - - [Diagrams](#diagrams) - - [Alternatives](#alternatives) -- [Open Questions](#open-questions) - - - ## Summary This proposal seeks to standardize the implementation of an EPP (End-point Picker) for the Inference Gateway extension (also known as Gateway API Inference Extension). Additionally, this proposes to restructure the current implementation of the EPP to be more modular, and approachable. @@ -86,11 +65,7 @@ Due to the possibility of this becoming a bit of a dumping ground. The API will The flow controller will consume resource regime data, and enforce proper resource sharing between workloads. This will primarily be done through a queuing mechanism [as described here](https://docs.google.com/document/d/1VZL7opFWuwgWquvgiOzLlXAJ633qZ9U-A0ZixGjBgaI/edit?usp=sharing). -#### Scheduling Layer - -As the Scheduling Layer is the final interface to the entirety of the pool, all configuration will be at the _pool_ level. The default scheduling layer will be an experimentally-backed LB algorithm, with exposed config values. -The Scheduler will define a strong interface API, so that new scheduling algos may be plugged & dark-launched to test in production traffic without impacting said traffic. Extension is expected to adhere to the [Scheduler Subsystem definition](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/603) ### `Non-extensible` diff --git a/docs/proposals/0845-scheduler-architecture-proposal/README.md b/docs/proposals/0845-scheduler-architecture-proposal/README.md new file mode 100644 index 000000000..33ba82e9d --- /dev/null +++ b/docs/proposals/0845-scheduler-architecture-proposal/README.md @@ -0,0 +1,77 @@ +# Scheduling Subsystem Architecture + +Author(s): @kfswain, @ahg-g, @nirrozenbaum +## Proposal Status + ***Draft*** + +## Summary +The Scheduling Subsystem is a framework used to implement scheduling algorithms. High level definition [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/006-scheduler) & EPP Architecture [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal). + +## Design Principles +- The scheduler framework should act as an independent library, there should be no dependency on EPP packages defined outside of the scheduler +- The *framework* should be agnostic to web protocols(such as HTTP), endpoint types (such as model servers), and K8s concepts. + - Opinions should be held by the plugins, not the framework +- The entry & exit points should be defined by the framework, acting as the API surface of the system +- Multiple scheduling 'profiles' should be able to be ran for a single request. + - They can be conditionally dependent on previous runs, or in parallel +- Plugin state is managed by the plugin itself + +## Definitions +- **Scheduling Framework** - The system created to allow for a pluggable scheduling algorithm. +- **Scheduling Profile** - A named, specific set of Filter(s), Scorer(s), & Picker used to select endpoints. +- **Scheduler** - An extensible implementation of a scheduling algorithm. Including logic to select Scheduling Profiles, the Scheduling Profiles themselves, & logic to interpret the result. +- **Scheduling Cycle** - A single run of a Scheduler through the Scheduling Framework. +- **Plugin** - Implementation of framework-defined interface(s) to add or extend logic across the framework. + +## Proposal + +The Scheduling System draws inspiration from the kube-schedulers pluggable system, though there are distinct differences in goals/usage. + +The Scheduling System can loosely be defined into 3 sections: +- A *framework* to implement the system +- The *interfaces* that a consumer can use to extend the system +- A *configuration API* to define the Scheduler, Profile(s), & the plugins used within those profiles + +A sketch of the System, with extension points is here: +Scheduling Algorithm + +Describing the interface extension points & flow is the simplest way to convey the intent of what the framework should enable: + +### PreSchedule + +PreSchedule is the entry point into the scheduling cycle (called by the framework). PreSchedule, selects profiles conditionally based on: + +- Request data +- Results +- Cycle State + +PreSchedule will be continuously called so long as profiles are returned; multiple profiles may be returned in a single call. Only a single PreSchedule function may be defined per scheduler. + +### Profile Cycle + +The profile cycle consists of 3 defined functions `Filter`, `Score`, & `Pick` + +*Profile Constraints* +- A profile can have any number of `Filter` plugins registered (including zero) +- A profile can have any number of `Score` plugins registered (including zero) +- A profile MUST have exactly one `Pick` plugin registered + + +#### Filter +Filter runs before any scoring, and remove endpoints that are not fit for selection. The framework will return an error to the client if the endpoints are filtered to zero. + +#### Score +Score applies a score to each remaining endpoint provided. Scorers SHOULD keep their score values in a normalized range: [0-1]. Any weighting should be added at the SchedulingProfile configuration level. + +#### Pick +Picker selects the endpoint(s) from the provided list of scored endpoints. Picker MUST return, one endpoint at minimum. + + +### PostSchedule +PostSchedule recieves the output of the result(s) of the scheduling cycle(s) and makes sense of the data to be consumed by the calling system. + +### PostResponse +PostResponse is a special case extension that can optionally be implemented by a plugin that needs to augment its state based on response or request data. This should only be implemented for plugins that need to update state outside of the scheduling cycle. PostResponse is ran at the time of processing a response. + +## ConfigurationAPI +TODO \ No newline at end of file diff --git a/docs/proposals/0845-scheduler-architecture-proposal/examples/example.yaml b/docs/proposals/0845-scheduler-architecture-proposal/examples/example.yaml new file mode 100644 index 000000000..06725a981 --- /dev/null +++ b/docs/proposals/0845-scheduler-architecture-proposal/examples/example.yaml @@ -0,0 +1,34 @@ +#names are egregiously long, but attempting to descibe custom logic within a name +profileSelection: disagg-token-length +schedulingResult: log-shadowbox-label-pd-result +profiles: + prefill: + preschedule: + - decode-prefix-cache-check + filter: + - is-prefill + - has-required-accelerator + score: + - prefix-cache: 3 + - latency-scorer: 2 + selection: + - best-score + postschedule: + - log-full-scores + decode: + filter: + - is-decode + score: + - prefix-cache: 3 + - kv-cache-util: 5 + selection: + - random-top-3 + shadowbox-decode: + filter: + - is-decode + - is-tpu + score: + - prefix-cache-v2: 4 + - kv-cache-util: 1 + selection: + - random-top-3 diff --git a/docs/proposals/0845-scheduler-architecture-proposal/images/scheduler_subsystem.svg b/docs/proposals/0845-scheduler-architecture-proposal/images/scheduler_subsystem.svg new file mode 100644 index 000000000..3186c1695 --- /dev/null +++ b/docs/proposals/0845-scheduler-architecture-proposal/images/scheduler_subsystem.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go b/docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go new file mode 100644 index 000000000..3adae83de --- /dev/null +++ b/docs/proposals/0845-scheduler-architecture-proposal/interfaces/interface.go @@ -0,0 +1,103 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package framework + +import ( + "context" + + scheduling "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +// READER NOTE: Currently CycleState is assumed to have appropriate request data rather that making a new object. + +// Plugin is the parent type for all the scheduling framework plugins. +type Plugin interface { + Name() string +} + +type Endpoint struct { + State EndpointState + Score float64 +} + +type EndpointState struct { + // storage is per Scheduling Cycle, and so has no thread-safe concerns. + storage map[string]any +} + +type SchedulingResult struct { + results map[string][]Endpoint +} + +// Scheduler is the implementation of a... scheduler. +// The scheduler object is created at startup using the provided configuration. +type Scheduler interface { + // PreSchedule selects scheduling profiles through the implemented + // logic, and returns: + // - profiles - A subset of the registered scheduling profiles to be ran + PreSchedule(request map[string]any, data scheduling.CycleState, results map[string][]Endpoint) map[string]SchedulingProfile + + // PostSchedule recieves the output of the result(s) of the scheduling cycle(s) + // and makes sense of the data to be consumed by the calling system. + // For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile. + // PostSchedule would know to simply log the result of ShadowBoxing + // profile, and do nothing else with it. + PostSchedule(profileResults map[string][]Endpoint) SchedulingResult +} + +// SchedulingProfile is used to describe a profile that will +// run for a given scheduling cycle. +type SchedulingProfile struct { + // Name of the profile. + Name string + // Filters lists all Filter plugins associated with this Profile. Filters + // are optional. + Filters []Filter + // Scorers lists all Score plugins associated with this Profile. Scorers + // are optional. + Scorers map[Scorer]int + // Picker returns the function that picks the endpoint(s). Picker is required. + Picker Picker +} + +// Filter runs before any scoring, and remove endpoints that are not fit for +// selection. The framework will return an error to the client if the endpoints +// are filtered to zero. +type Filter interface { + Plugin + Filter(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint +} + +// Scorer applies a score to each remaining endpoint provided. Scorers SHOULD +// keep their score values in a normalized range: [0-1]. Any weighting should +// be added at the SchedulingProfile configuration level. +type Scorer interface { + Plugin + Score(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint +} + +// Picker selects the endpoint(s) from the provided list of scored endpoints. +// Picker MUST return, one endpoint at minimum. +type Picker interface { + Plugin + Pick(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint +} + +type PostResponse interface { + Plugin + PostResponse(ctx context.Context, request map[string]any, response map[string]any) +}