Refactor scheduler to make it more readable #645

liu-cong · 2025-04-03T05:27:11Z

This refactor does the following:

Explicitly define ActiveModels and WaitingModels that map to the running and waiting adapters metric. Previously we combine both as ActiveModels. This change makes it clear.
Create a scheduling.Context object that holds the contextual info during a request scheduling cycle. This has 2 benefits: a) Any contextual info can be added to this struct instead of modifying the scheduler interface, making extending the scheduler easier (will soon need this for prefix caching implementation). b) Create a snapshot of the pod metrics for the scheduling cycle, to reduce concurrent access to the shared datastore, as well as provide a consistent view of the pods and metrics during scheduling. This makes debugging easier.
Created a simpleFilter which contains the user readable filter name plus filter function. Making the filter chain composition much cleaner.
This is the only change of logic. Previously we had 2 slightly different LoRA affinity filters, one is considered the "main" lora affinity filter, the other one is only used when all pods are significantly queued up (>128 waiting requests). As far as I can tell, there was no clear reason why we needed both. The reason was because we added the "main" one later to account for long queues. In this PR I consolidated to the "main" lora affinity filter. And benchmark results showed the refactored version had almost identical results with HEAD.

Benchmarks

I ran benchmarks using the LPG tool, with 10 lora adapters sharing the same traffic, and 8 vllm replicas running on H100 with max-lora=4.

LPG reported data between baseline and refactor:

EPP metrics and vLLM metrics between the baseline and refactor
- baseline
- Refactor

liu-cong · 2025-04-03T05:27:32Z

cc @kaushikmitr @kfswain

netlify · 2025-04-03T05:27:34Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`87162b4`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67f067432baed800080c0429
😎 Deploy Preview	https://deploy-preview-645--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

liu-cong · 2025-04-03T05:28:49Z

pkg/epp/scheduling/scheduler.go

+	logger.V(logutil.DEBUG).Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.podsSnapshot))
+
+	var filter Filter
+	if req.Critical {


It's much cleaner this way to show how we handle critical vs. sheddable differently, than the previous filter.

liu-cong · 2025-04-03T05:37:19Z

pkg/epp/scheduling/filter.go


-		if _, exists := pod.GetMetrics().ActiveModels[req.ResolvedTargetModel]; exists {
+		if active || waiting {


This is where we consider both active and waiting as lora affinity. This refactor just makes it super clear.

liu-cong · 2025-04-03T05:40:25Z

pkg/epp/scheduling/filter.go

-// spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to
-// a single pod. This gave good performance in our initial benchmarking results in the scenario
-// where # of lora slots > # of lora adapters.
-func lowLoRACostPredicate(req *LLMRequest, pod backendmetrics.PodMetrics) bool {


This is the original lora affinity filter. This one considered a pod with an active lora adapter equally with a pod that has space to load a new lora.

Later on we introduced the loRASoftAffinityFilter below, which generally favors pods with active loras, but with a probability to choose a pod with space to load a new lora, to avoid hot spots.

The new affinity filter is considered better than the original one. There is no clear reason why we want to keep the original one.

kfswain · 2025-04-03T15:55:23Z

Ack! Will take a look later today, this looks awesome at a glance. Thanks!

kaushikmitr · 2025-04-04T20:19:39Z

If I read the first chart correctly the QPS goes all the way up to 3000? 😮

liu-cong · 2025-04-04T22:16:19Z

If I read the first chart correctly the QPS goes all the way up to 3000? 😮

Not the real QPS ... I set the QPS to 300, with 10 adapters, so that's how the 3000 is calculated. However, the LPG tool isn't able to send that high QPS though. If you look at the InferenceModel metrics on EPP, the real QPS was about 30 requests/s for each adapter.

nirrozenbaum

adding my two cents -
I think the current code is mixing filter and scorer terms.
Filter should be defined as a hard requirement, that if we cannot meet the request is dropped.
Scorer should be defined as a soft requirement, that we can use to score/rank pods.
with the above terminology - if a filter fails (ends up with 0 pods), the request should fail.

additionally, I think the current impl that uses "decision tree" is very hard to follow and makes it very hard to understand what is the pod selection logic. I would flatten this structure and have a slice of filters and a slice of scorers (could be defined differently based on the request and other properties, e.g., criticality).

the way I see it (using the above terminology) -
for each request a set of filters should run sequentially (no meaning for the order cause each is just a boolean conditional).
if we're left with pods that can serve the request, we need to rank them and run scorers (each scorer is independent, scorers can run in parallel). then we can calculate a weighted score for each pod based on scorers configuration (the weight of each).

another important point I would consider - current code makes it very hard for someone to fork the code and change filters. we need to make it a super easy thing to do by design.

@kfswain this is part of the discussion we started in the last community call and we probably need to continue.

ahg-g · 2025-04-06T23:17:16Z

I agree we need to move to a better structure that potentially mimics that of the pluggable framework of kube-scheduler, but I believe this is a good initial refactor, it is net positive. Leaving lgtm to others.

/approve

k8s-ci-robot · 2025-04-06T23:17:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nirrozenbaum · 2025-04-07T11:32:38Z

my comments were in general about the existing structuring and not about this PR.
this PR is an improvement comparing to the existing code.

agree with you and leaving the "better structuring" discussion out of the scope of this PR.

/lgtm

kfswain · 2025-04-07T15:18:47Z

additionally, I think the current impl that uses "decision tree" is very hard to follow and makes it very hard to understand what is the pod selection logic. I would flatten this structure and have a slice of filters and a slice of scorers (could be defined differently based on the request and other properties, e.g., criticality).

+1 current implementation is rough to mentally follow. I've been thinking about how we can separate this logic.

B/C our tooling has some gateway-like properties (like InferenceModel routing) and some LB-like properties (our scheduler) I think we can break this up.

We have a Routing layer, that handles any routing needs such as LoRA selection, prompt injection, maybe even checks that reject a request before it ever need be scheduled (and can be easily extended by a user) and then a Scheduling layer, that just scores endpoints (I don't know if we need to do any filtering, perhaps I'm wrong but i think we can indirectly filter via score) but we define that interface strongly.

We will also be introducing queuing at the EPP layer. So that makes the logic even more clean:

Routing layer: modifies request and validates that the request should even be sent, w/an interface to make extension simple
Queuing mechanism acts as flow control to scheduler
Scheduler: scoring mechanism to select an endpoint, also w/ a strong interface that allows for custom implementations to be easily plugged

I'll have more to present on the Th meeting. I'll get something written up/diagram/etc

liu-cong · 2025-04-07T17:25:44Z

Agree that we should simplify the filter tree, and a promising approach is to have a chain of filter and score plugins like the kube scheduler.

I did some initial experiment with changing the filters to "scorers" in https://github.com/liu-cong/llm-instance-gateway/tree/score

My initial benchmark results didn't performa as good as current implementation on LoRA affinity. Though with some tuning we can perhaps perform better. Will keep experimenting once I get some free cycles.

k8s-ci-robot requested review from ahg-g and danehans April 3, 2025 05:27

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 3, 2025

liu-cong force-pushed the refactor branch 2 times, most recently from a5035ff to df2cfb5 Compare April 3, 2025 05:33

liu-cong commented Apr 3, 2025

View reviewed changes

liu-cong force-pushed the refactor branch from df2cfb5 to 2a2e032 Compare April 4, 2025 18:01

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 4, 2025

Refactor scheduler

87162b4

liu-cong force-pushed the refactor branch from 2a2e032 to 87162b4 Compare April 4, 2025 23:12

nirrozenbaum reviewed Apr 6, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2025

k8s-ci-robot assigned nirrozenbaum Apr 7, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2025

k8s-ci-robot merged commit 264ee45 into kubernetes-sigs:main Apr 7, 2025
8 checks passed

kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025

Refactor scheduler (kubernetes-sigs#645)

153eae8

kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025

Refactor scheduler (kubernetes-sigs#645)

690e55a

kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025

Refactor scheduler (kubernetes-sigs#645)

11bf77b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor scheduler to make it more readable #645

Refactor scheduler to make it more readable #645

Uh oh!

liu-cong commented Apr 3, 2025

Uh oh!

liu-cong commented Apr 3, 2025

Uh oh!

netlify bot commented Apr 3, 2025 •

edited

Loading

Uh oh!

liu-cong Apr 3, 2025

Uh oh!

liu-cong Apr 3, 2025

Uh oh!

liu-cong Apr 3, 2025

Uh oh!

kfswain commented Apr 3, 2025

Uh oh!

kaushikmitr commented Apr 4, 2025

Uh oh!

liu-cong commented Apr 4, 2025

Uh oh!

nirrozenbaum left a comment •

edited

Loading

Uh oh!

ahg-g commented Apr 6, 2025

Uh oh!

k8s-ci-robot commented Apr 6, 2025

Uh oh!

nirrozenbaum commented Apr 7, 2025

Uh oh!

Uh oh!

kfswain commented Apr 7, 2025 •

edited

Loading

Uh oh!

liu-cong commented Apr 7, 2025

Uh oh!

Uh oh!


		if _, exists := pod.GetMetrics().ActiveModels[req.ResolvedTargetModel]; exists {
		if active \|\| waiting {

Refactor scheduler to make it more readable #645

Refactor scheduler to make it more readable #645

Uh oh!

Conversation

liu-cong commented Apr 3, 2025

Benchmarks

Uh oh!

liu-cong commented Apr 3, 2025

Uh oh!

netlify bot commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

liu-cong Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

liu-cong Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

liu-cong Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

kfswain commented Apr 3, 2025

Uh oh!

kaushikmitr commented Apr 4, 2025

Uh oh!

liu-cong commented Apr 4, 2025

Uh oh!

nirrozenbaum left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Apr 6, 2025

Uh oh!

k8s-ci-robot commented Apr 6, 2025

Uh oh!

nirrozenbaum commented Apr 7, 2025

Uh oh!

Uh oh!

kfswain commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liu-cong commented Apr 7, 2025

Uh oh!

Uh oh!

netlify bot commented Apr 3, 2025 •

edited

Loading

nirrozenbaum left a comment •

edited

Loading

kfswain commented Apr 7, 2025 •

edited

Loading