Weighted scorers #737

nirrozenbaum · 2025-04-24T20:40:02Z

This PR continues the refactoring of the scheduler and introduces new functionality.
at the very high level:

Scorer interface changed slightly. the previous interface was changed such that Score function now gets all pods instead of one by one. this was raised as a requirement from IBM for scorers that use an external service and not in memory calculations (e.g., send request for redis to calculate kvcache). there is a big benefit is sending all pods in one batch and not one by one. additionally when following the schedule flow it makes much sense in a way that:

filter([]pod) returns []pod // filtered pods
scorer([pods] returns map[pod]float // a map from a pod to it's score
picker(map[pod]float) returns the picked pod

so every step in the flow is using the output of the previous step.

weighted scorers were added. every scorer is initialized with a weight. the weighted score of a scorer is calculated by multiplying by (scorer_weight/sum_of_all_scorers_weights).
unit tests added to test all these changes.
continued with handling some comments from Refactor scheduler to run plugins #677 review, such as removing noopPlugin.

out of scope of this PR and will be added in another follow up PR:

scores of the scorers have to be normalized. there is a TODO in scheduler file to implement the NormalizeScores function. until we have Cong's prefix aware scorers, the scorers are empty, so it has no implications.
need to continue the refactoring like moving each filter to its own file. the whole idea of this refactoring is extensibility at build time. for extensibility and flexibility we need to be able to isolate each filter.
pkg/epp/scheduling/config/config.go has to be removed. these are values called from env vars for the specific filters. this cannot work if one want to change plugins during build time. instead, each filter should read the relevant information it needs during its NewFilter creation function, or alternatively, should be called by the NewFilter caller and passed as an argument.
it definitely shouldn't be mandatory as filters/scorers may change

Signed-off-by: Nir Rozenbaum <[email protected]>

netlify · 2025-04-24T20:40:21Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`6e68563`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/680dcdfaa1fb0b0008fa4a8f
😎 Deploy Preview	https://deploy-preview-737--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

nirrozenbaum · 2025-04-24T20:46:37Z

pkg/epp/scheduling/config.go

+// When the scheduler is initialized with NewScheduler function, this config will be used as default.
+// it's possible to call NewSchedulerWithConfig to pass a different argument.
+
+// For build time plugins changes, it's recommended to change the defaultConfig variable in this file.


PAY ATTENTION.
when someone wants to change GIE default filters/scorers/picker/etc by forking the repo the flow is the following:

fork repo

add their own plugins in new files.

change ONLY THIS FILE to include their set of plugins using the defaultConfig variable.

that's it. build epp.

the idea behind this small file is for such users to be able to keep their fork synced without drifting from upstream.
so since they have only this single file change. they can sync the rest of the code to get bug fixes and such.

why not to manage two separate files:
one for SchedulerConfig definition which will not be changed in personal fork (since it could be changed in the upstream)
and another file which defines the config instance with all required filters/scorers/picker/...

if SchedulerConfig has changed in upstream (e.g., includes new fields) that means Scheduler itself changed in upstream (cause the scheduler uses the fields from the config).
in such a case it's not possible to keep the previous config in a personal fork and use the scheduler from upstream.

nirrozenbaum · 2025-04-24T20:49:57Z

CC @kfswain @liu-cong

liu-cong · 2025-04-24T21:28:51Z

pkg/epp/scheduling/types/types.go

 type PodMetrics struct {
-	score float64


I will keep the score here. It's clean, extensible, and also allows sorting pods based on score easier, than a map from pods to scores.

score is not a property of PodMetrics snapshot, but a result of calling the Score function.
PodMetrics is a snapshot of metrics at a given point in time and should be a read only interface that helps in plugins calculations during the scheduling cycle (e.g., filter/scorer may use metrics).

we may add a new, different struct, ScoredPod to have the fields Pod and Score and use that instead of the map, which will allow the easy sorting.

type ScoredPod struct { Pod types.Pod Score float64 }

LMKWYT

Sg, I prefer []ScoredPod than the map, since it makes either for random picking or sorting based on score easier. Another option is to have ScoredPods which can contain both a list and a map, if the map is needed for any reason in the future.

I added the ScoredPod struct and updated the code to use it.
please pay attention that Score function still defines the return value as map[pod]float which is exactly the definition of a scorer, giving a score to each of the candidate pods.
this map is also very useful for accumulating the weighted score per pod in a fast way (hash).

before passing the pods with the weighted scores to the picker, this is converted into []ScoredPod.

the way I see it, picker needs to be able to iterate over the pods AND to be able to sort pods based on score.
no use of both slice and map at this point. but if it becomes relevant at some point we can add.

pkg/epp/scheduling/plugins/picker/random_picker.go

pkg/epp/scheduling/scheduler.go

pkg/epp/scheduling/plugins/plugins.go

kfswain · 2025-04-25T16:15:46Z

pkg/epp/scheduling/scheduler.go

@@ -106,21 +112,21 @@ func (s *Scheduler) Schedule(ctx context.Context, req *types.LLMRequest) (*types
 	// 1. Reduce concurrent access to the datastore.
 	// 2. Ensure consistent data during the scheduling operation of a request.
 	sCtx := types.NewSchedulingContext(ctx, req, types.ToSchedulerPodMetrics(s.datastore.PodGetAll()))
-	loggerDebug.Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.PodsSnapshot))
+	loggerDebug.Info(fmt.Sprintf("Scheduling a request, Metrics: %+v", sCtx.PodsSnapshot))

 	s.runPreSchedulePlugins(sCtx)

 	pods := s.runFilterPlugins(sCtx)
 	if len(pods) == 0 {
 		return nil, errutil.Error{Code: errutil.InferencePoolResourceExhausted, Msg: "failed to find a target pod"}


I dont think its in scope of this PR, but we should think about changing this error code. there's no guarantee that resources are exhausted when there are no pods, just that the filter for the specific request came up with nothing. I worry that sends incorrect signals to the reader.

This is a regression from b24f948 I believe. Previously this error was returned by the sheddableRequestFilter which knows this is a resource exhausted error.

I would argue to keep the error from the Filter and Scorers, even if we don't need them today. This is more future proof, and addresses issues like this. @nirrozenbaum What do you think about bringing the errors back?
Not asking to do it in this PR. Can be a follow up.

as of today, we have in main the following filters:
LowQueueFilter, LoRAAffinityFilter, LeastQueueFilter, LeastKVCacheFilter, HasCapacityFilter. In addition to those, we're currently implementing few more filters and scorers in IBM that will be added to GIE when ready.
none of the above uses error.
I think this should give us some hints about the usage of error in filters/scorers/picker.

I'm in favor of being flexible, but I think we should be use-case driven and not try to be future proof, otherwise we may end up implementing things that are not used and/or not relevant.

for example, we changed the Score function to get all pods. this is flexible because it allows batching pods to an external scorer and on the same time it also allows scoring each pod separately.
but this change was use-case driven - it is required since we develop in IBM a KVCache scorer which uses redis as an external service and @kfswain also showed in his py-go that it's useful.

The DropRequest is a use case to return an error. How do we solve that without an error?
My point of being future proof is that it will take a lot of refactoring effort if we add it back later, once we have many more plugins in the future.

we started this discussion on #677. I'll put here my interpretation for this question:
from my understanding, DropRequestFilter == FilterAllPods (please correct me if I'm wrong).
in other words, this is a filter which results in an answer of "no pods passed this filter".
this is not unique to this filter and can happen in any filter (scheduler is extensible now, new filters might be written and used).

first point, we can argue if filter that its result is an empty list is an error. I claim that there was no error in the filter itself, as the filter worked correctly and filtered according to its definition. the result is that we cannot serve the request, but this is not an error with the filter (scheduler returns an error if len(filtered_pods) == 0).

second point, and I'll refer to the code as it was in #677 (since it has changed since then) - this error was used in two places in the code:

https://github.com/liu-cong/llm-instance-gateway/blob/83589ebc702ce65c43bf88efcb46d9286ae41bd0/pkg/epp/scheduling/plugins/filter.go#L119-L121
here, the conversion to filter func checks if no pods were left after the filtering, and translates that into an error.

https://github.com/liu-cong/llm-instance-gateway/blob/83589ebc702ce65c43bf88efcb46d9286ae41bd0/pkg/epp/scheduling/plugins/filter.go#L81
here, in the decision tree, there is a check if err == nil && len(filtered) > 0 . since the only case where error was used is to "mark" that no pods left (in prev point), this err ==nil condition can be removed and checking the second condition is sufficient.

to summarize, IMO the DropRequest that was previously used is not a good example of an error usage, since this is not an error. it was used (I think incorrectly) to mark that no pods were left. for this check, we can easily check the length of the returned filter pods.

You have a lot of great points!

If a filter results in 0 pods, this will effectively lead to an error for the scheduler. While it's debatable whether it's an error from the filter perspective, we need to communicate this to the scheduler. Currently the scheduler simply returns a ResourceExhausted error, which seems too broad. IMO the filter should communicate a reason why it filtered out all pods.

How about adding another "filterReason" return value (perhaps an enum), if an error is too strong, so the scheduler can interpret the empty pods and return a proper error to consumers?

yes, I agree that ResourceExhausted is not the right error to use here and that should be changed.
we're in some intermediate phase and I'm trying to break down the scheduler refactoring into smaller PRs, to allow easier reviews and merge. I will surely change this in the next PR.
I tend to say this discussion is only around filter. once there is at least 1 pod left after filter part, I really don't think we should abort a scheduling cycle if there is a temp issue with one of the scorers, that scorer can return zero value for all candidate pods until the issue is resolved (and write the issue to log). this is very different from the kube-scheduler framework in this aspect.

"filterReason" is another word for saying error :) if there is no error we won't use the reason.
so if we decide on adding such a thing, it's better to use error which can be wrapped in upper layers errors.

can you give an example of "filterReason" you can think of?
I think writing in the returned error which filter is the one that filtered all pods is useful. this is what I planned in my next PR (maybe I can push this in a separate PR today).
but what reason could be other than - the candidate pods doesn't meet the filter criteria?
at the end, it's a boolean condition.

pkg/epp/scheduling/scheduler.go

kfswain · 2025-04-25T16:21:08Z

LGTM, left a few comments, will leave to others to give the final LGTM

/approve

k8s-ci-robot · 2025-04-25T16:21:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Nir Rozenbaum <[email protected]>

liu-cong

/lgtm

/hold in case you want to address the nits. Feel free to unhold

liu-cong · 2025-04-28T02:50:19Z

pkg/epp/scheduling/types/types.go

 	String() string
 }

+type ScoredPod struct {
+	Pod   Pod


nit: I think embedding the Pod is better.

good idea.
will update in a follow up PR not to delay this one.

liu-cong · 2025-04-28T02:53:56Z

pkg/epp/scheduling/scheduler.go

 	for _, pod := range pods {
-		score := s.runScorersForPod(ctx, pod)
-		pod.SetScore(score)
+		weightedScorePerPod[pod] = float64(0) // initialize weighted score per pod with 0 value


nit: I don't think this is necessary.

if no scorers are configured (for example like today in main), removing this line will cause errors.
try to remove it locally and run unit test to see the error.

nirrozenbaum · 2025-04-28T09:09:35Z

/unhold

* removed unused noop plugin Signed-off-by: Nir Rozenbaum <[email protected]> * more scheduler refactoring Signed-off-by: Nir Rozenbaum <[email protected]> * more refactoring Signed-off-by: Nir Rozenbaum <[email protected]> * added weights to scorers and calculating weighted score Signed-off-by: Nir Rozenbaum <[email protected]> * addressed code review comments Signed-off-by: Nir Rozenbaum <[email protected]> --------- Signed-off-by: Nir Rozenbaum <[email protected]>

nirrozenbaum added 4 commits April 24, 2025 21:34

removed unused noop plugin

9bbd13e

Signed-off-by: Nir Rozenbaum <[email protected]>

more scheduler refactoring

4d7625c

Signed-off-by: Nir Rozenbaum <[email protected]>

more refactoring

c5a41ce

Signed-off-by: Nir Rozenbaum <[email protected]>

added weights to scorers and calculating weighted score

e9293e4

Signed-off-by: Nir Rozenbaum <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 24, 2025

k8s-ci-robot requested review from danehans and robscott April 24, 2025 20:40

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 24, 2025

nirrozenbaum commented Apr 24, 2025

View reviewed changes

liu-cong reviewed Apr 24, 2025

View reviewed changes

kfswain reviewed Apr 25, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2025

nirrozenbaum mentioned this pull request Apr 25, 2025

Docs: Create EPP Operations Guide #735

Open

addressed code review comments

6e68563

Signed-off-by: Nir Rozenbaum <[email protected]>

nirrozenbaum mentioned this pull request Apr 27, 2025

Implemented KVCacheAwareScorer neuralmagic/gateway-api-inference-extension#34

Merged

liu-cong reviewed Apr 28, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2025

k8s-ci-robot assigned liu-cong Apr 28, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2025

k8s-ci-robot merged commit 855436e into kubernetes-sigs:main Apr 28, 2025
8 checks passed

nirrozenbaum deleted the weighted-scorers branch April 28, 2025 09:30

Weighted scorers #737

Weighted scorers #737

Uh oh!

Conversation

nirrozenbaum commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

nirrozenbaum Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum commented Apr 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-cong Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kfswain commented Apr 25, 2025

Uh oh!

k8s-ci-robot commented Apr 25, 2025

Uh oh!

liu-cong left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum commented Apr 28, 2025

Uh oh!

Uh oh!

Uh oh!

nirrozenbaum commented Apr 24, 2025 •

edited

Loading

netlify bot commented Apr 24, 2025 •

edited

Loading

nirrozenbaum Apr 24, 2025 •

edited

Loading

nirrozenbaum Apr 25, 2025 •

edited

Loading

liu-cong Apr 25, 2025 •

edited

Loading

nirrozenbaum Apr 27, 2025 •

edited

Loading