Skip to content

Refactor scheduler to make it more readable #645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2025

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Apr 3, 2025

This refactor does the following:

  1. Explicitly define ActiveModels and WaitingModels that map to the running and waiting adapters metric. Previously we combine both as ActiveModels. This change makes it clear.
  2. Create a scheduling.Context object that holds the contextual info during a request scheduling cycle. This has 2 benefits: a) Any contextual info can be added to this struct instead of modifying the scheduler interface, making extending the scheduler easier (will soon need this for prefix caching implementation). b) Create a snapshot of the pod metrics for the scheduling cycle, to reduce concurrent access to the shared datastore, as well as provide a consistent view of the pods and metrics during scheduling. This makes debugging easier.
  3. Created a simpleFilter which contains the user readable filter name plus filter function. Making the filter chain composition much cleaner.
  4. This is the only change of logic. Previously we had 2 slightly different LoRA affinity filters, one is considered the "main" lora affinity filter, the other one is only used when all pods are significantly queued up (>128 waiting requests). As far as I can tell, there was no clear reason why we needed both. The reason was because we added the "main" one later to account for long queues. In this PR I consolidated to the "main" lora affinity filter. And benchmark results showed the refactored version had almost identical results with HEAD.

Benchmarks

I ran benchmarks using the LPG tool, with 10 lora adapters sharing the same traffic, and 8 vllm replicas running on H100 with max-lora=4.

  • LPG reported data between baseline and refactor:

image

  • EPP metrics and vLLM metrics between the baseline and refactor

    • baseline
      image
      image

    • Refactor
      image
      image

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and danehans April 3, 2025 05:27
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 3, 2025
@liu-cong
Copy link
Contributor Author

liu-cong commented Apr 3, 2025

cc @kaushikmitr @kfswain

Copy link

netlify bot commented Apr 3, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 87162b4
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67f067432baed800080c0429
😎 Deploy Preview https://deploy-preview-645--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@liu-cong liu-cong force-pushed the refactor branch 2 times, most recently from a5035ff to df2cfb5 Compare April 3, 2025 05:33
logger.V(logutil.DEBUG).Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.podsSnapshot))

var filter Filter
if req.Critical {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much cleaner this way to show how we handle critical vs. sheddable differently, than the previous filter.


if _, exists := pod.GetMetrics().ActiveModels[req.ResolvedTargetModel]; exists {
if active || waiting {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we consider both active and waiting as lora affinity. This refactor just makes it super clear.

// spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to
// a single pod. This gave good performance in our initial benchmarking results in the scenario
// where # of lora slots > # of lora adapters.
func lowLoRACostPredicate(req *LLMRequest, pod backendmetrics.PodMetrics) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original lora affinity filter. This one considered a pod with an active lora adapter equally with a pod that has space to load a new lora.

Later on we introduced the loRASoftAffinityFilter below, which generally favors pods with active loras, but with a probability to choose a pod with space to load a new lora, to avoid hot spots.

The new affinity filter is considered better than the original one. There is no clear reason why we want to keep the original one.

@kfswain
Copy link
Collaborator

kfswain commented Apr 3, 2025

Ack! Will take a look later today, this looks awesome at a glance. Thanks!

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 4, 2025
@kaushikmitr
Copy link
Contributor

If I read the first chart correctly the QPS goes all the way up to 3000? 😮

@liu-cong
Copy link
Contributor Author

liu-cong commented Apr 4, 2025

If I read the first chart correctly the QPS goes all the way up to 3000? 😮

Not the real QPS ... I set the QPS to 300, with 10 adapters, so that's how the 3000 is calculated. However, the LPG tool isn't able to send that high QPS though. If you look at the InferenceModel metrics on EPP, the real QPS was about 30 requests/s for each adapter.

Copy link
Contributor

@nirrozenbaum nirrozenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding my two cents -
I think the current code is mixing filter and scorer terms.
Filter should be defined as a hard requirement, that if we cannot meet the request is dropped.
Scorer should be defined as a soft requirement, that we can use to score/rank pods.
with the above terminology - if a filter fails (ends up with 0 pods), the request should fail.

additionally, I think the current impl that uses "decision tree" is very hard to follow and makes it very hard to understand what is the pod selection logic. I would flatten this structure and have a slice of filters and a slice of scorers (could be defined differently based on the request and other properties, e.g., criticality).

the way I see it (using the above terminology) -
for each request a set of filters should run sequentially (no meaning for the order cause each is just a boolean conditional).
if we're left with pods that can serve the request, we need to rank them and run scorers (each scorer is independent, scorers can run in parallel). then we can calculate a weighted score for each pod based on scorers configuration (the weight of each).

another important point I would consider - current code makes it very hard for someone to fork the code and change filters. we need to make it a super easy thing to do by design.

@kfswain this is part of the discussion we started in the last community call and we probably need to continue.

@ahg-g
Copy link
Contributor

ahg-g commented Apr 6, 2025

I agree we need to move to a better structure that potentially mimics that of the pluggable framework of kube-scheduler, but I believe this is a good initial refactor, it is net positive. Leaving lgtm to others.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, liu-cong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2025
@nirrozenbaum
Copy link
Contributor

my comments were in general about the existing structuring and not about this PR.
this PR is an improvement comparing to the existing code.

agree with you and leaving the "better structuring" discussion out of the scope of this PR.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 7, 2025
@k8s-ci-robot k8s-ci-robot merged commit 264ee45 into kubernetes-sigs:main Apr 7, 2025
8 checks passed
@kfswain
Copy link
Collaborator

kfswain commented Apr 7, 2025

additionally, I think the current impl that uses "decision tree" is very hard to follow and makes it very hard to understand what is the pod selection logic. I would flatten this structure and have a slice of filters and a slice of scorers (could be defined differently based on the request and other properties, e.g., criticality).

+1 current implementation is rough to mentally follow. I've been thinking about how we can separate this logic.

B/C our tooling has some gateway-like properties (like InferenceModel routing) and some LB-like properties (our scheduler) I think we can break this up.

We have a Routing layer, that handles any routing needs such as LoRA selection, prompt injection, maybe even checks that reject a request before it ever need be scheduled (and can be easily extended by a user) and then a Scheduling layer, that just scores endpoints (I don't know if we need to do any filtering, perhaps I'm wrong but i think we can indirectly filter via score) but we define that interface strongly.

We will also be introducing queuing at the EPP layer. So that makes the logic even more clean:

  1. Routing layer: modifies request and validates that the request should even be sent, w/an interface to make extension simple
  2. Queuing mechanism acts as flow control to scheduler
  3. Scheduler: scoring mechanism to select an endpoint, also w/ a strong interface that allows for custom implementations to be easily plugged

I'll have more to present on the Th meeting. I'll get something written up/diagram/etc

@liu-cong
Copy link
Contributor Author

liu-cong commented Apr 7, 2025

Agree that we should simplify the filter tree, and a promising approach is to have a chain of filter and score plugins like the kube scheduler.

I did some initial experiment with changing the filters to "scorers" in https://github.com/liu-cong/llm-instance-gateway/tree/score

My initial benchmark results didn't performa as good as current implementation on LoRA affinity. Though with some tuning we can perhaps perform better. Will keep experimenting once I get some free cycles.

kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025
kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025
kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants