-
Notifications
You must be signed in to change notification settings - Fork 89
Refactor scheduler to make it more readable #645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
a5035ff
to
df2cfb5
Compare
logger.V(logutil.DEBUG).Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.podsSnapshot)) | ||
|
||
var filter Filter | ||
if req.Critical { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's much cleaner this way to show how we handle critical vs. sheddable differently, than the previous filter.
|
||
if _, exists := pod.GetMetrics().ActiveModels[req.ResolvedTargetModel]; exists { | ||
if active || waiting { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we consider both active and waiting as lora affinity. This refactor just makes it super clear.
// spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to | ||
// a single pod. This gave good performance in our initial benchmarking results in the scenario | ||
// where # of lora slots > # of lora adapters. | ||
func lowLoRACostPredicate(req *LLMRequest, pod backendmetrics.PodMetrics) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original lora affinity filter. This one considered a pod with an active lora adapter equally with a pod that has space to load a new lora.
Later on we introduced the loRASoftAffinityFilter
below, which generally favors pods with active loras, but with a probability to choose a pod with space to load a new lora, to avoid hot spots.
The new affinity filter is considered better than the original one. There is no clear reason why we want to keep the original one.
Ack! Will take a look later today, this looks awesome at a glance. Thanks! |
If I read the first chart correctly the QPS goes all the way up to 3000? 😮 |
Not the real QPS ... I set the QPS to 300, with 10 adapters, so that's how the 3000 is calculated. However, the LPG tool isn't able to send that high QPS though. If you look at the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding my two cents -
I think the current code is mixing filter
and scorer
terms.
Filter should be defined as a hard requirement, that if we cannot meet the request is dropped.
Scorer should be defined as a soft requirement, that we can use to score/rank pods.
with the above terminology - if a filter fails (ends up with 0 pods), the request should fail.
additionally, I think the current impl that uses "decision tree" is very hard to follow and makes it very hard to understand what is the pod selection logic. I would flatten this structure and have a slice of filters and a slice of scorers (could be defined differently based on the request and other properties, e.g., criticality).
the way I see it (using the above terminology) -
for each request a set of filters should run sequentially (no meaning for the order cause each is just a boolean conditional).
if we're left with pods that can serve the request, we need to rank them and run scorers (each scorer is independent, scorers can run in parallel). then we can calculate a weighted score for each pod based on scorers configuration (the weight of each).
another important point I would consider - current code makes it very hard for someone to fork the code and change filters. we need to make it a super easy thing to do by design.
@kfswain this is part of the discussion we started in the last community call and we probably need to continue.
I agree we need to move to a better structure that potentially mimics that of the pluggable framework of kube-scheduler, but I believe this is a good initial refactor, it is net positive. Leaving lgtm to others. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, liu-cong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
my comments were in general about the existing structuring and not about this PR. agree with you and leaving the "better structuring" discussion out of the scope of this PR. /lgtm |
+1 current implementation is rough to mentally follow. I've been thinking about how we can separate this logic. B/C our tooling has some gateway-like properties (like We have a We will also be introducing queuing at the EPP layer. So that makes the logic even more clean:
I'll have more to present on the Th meeting. I'll get something written up/diagram/etc |
Agree that we should simplify the filter tree, and a promising approach is to have a chain of I did some initial experiment with changing the filters to "scorers" in https://github.com/liu-cong/llm-instance-gateway/tree/score My initial benchmark results didn't performa as good as current implementation on LoRA affinity. Though with some tuning we can perhaps perform better. Will keep experimenting once I get some free cycles. |
This refactor does the following:
ActiveModels
andWaitingModels
that map to the running and waiting adapters metric. Previously we combine both asActiveModels
. This change makes it clear.scheduling.Context
object that holds the contextual info during a request scheduling cycle. This has 2 benefits: a) Any contextual info can be added to this struct instead of modifying the scheduler interface, making extending the scheduler easier (will soon need this for prefix caching implementation). b) Create a snapshot of the pod metrics for the scheduling cycle, to reduce concurrent access to the shared datastore, as well as provide a consistent view of the pods and metrics during scheduling. This makes debugging easier.simpleFilter
which contains the user readable filter name plus filter function. Making the filter chain composition much cleaner.Benchmarks
I ran benchmarks using the LPG tool, with 10 lora adapters sharing the same traffic, and 8 vllm replicas running on H100 with max-lora=4.
EPP metrics and vLLM metrics between the baseline and refactor
baseline


Refactor

