Skip to content

Weighted scorers #737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 28, 2025
Merged

Conversation

nirrozenbaum
Copy link
Contributor

@nirrozenbaum nirrozenbaum commented Apr 24, 2025

This PR continues the refactoring of the scheduler and introduces new functionality.
at the very high level:

  • Scorer interface changed slightly. the previous interface was changed such that Score function now gets all pods instead of one by one. this was raised as a requirement from IBM for scorers that use an external service and not in memory calculations (e.g., send request for redis to calculate kvcache). there is a big benefit is sending all pods in one batch and not one by one. additionally when following the schedule flow it makes much sense in a way that:
filter([]pod) returns []pod // filtered pods
scorer([pods] returns map[pod]float // a map from a pod to it's score
picker(map[pod]float) returns the picked pod

so every step in the flow is using the output of the previous step.

  • weighted scorers were added. every scorer is initialized with a weight. the weighted score of a scorer is calculated by multiplying by (scorer_weight/sum_of_all_scorers_weights).

  • unit tests added to test all these changes.

  • continued with handling some comments from Refactor scheduler to run plugins #677 review, such as removing noopPlugin.

out of scope of this PR and will be added in another follow up PR:

  • scores of the scorers have to be normalized. there is a TODO in scheduler file to implement the NormalizeScores function. until we have Cong's prefix aware scorers, the scorers are empty, so it has no implications.
  • need to continue the refactoring like moving each filter to its own file. the whole idea of this refactoring is extensibility at build time. for extensibility and flexibility we need to be able to isolate each filter.
  • pkg/epp/scheduling/config/config.go has to be removed. these are values called from env vars for the specific filters. this cannot work if one want to change plugins during build time. instead, each filter should read the relevant information it needs during its NewFilter creation function, or alternatively, should be called by the NewFilter caller and passed as an argument.
    it definitely shouldn't be mandatory as filters/scorers may change

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 24, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 24, 2025
Copy link

netlify bot commented Apr 24, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 6e68563
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/680dcdfaa1fb0b0008fa4a8f
😎 Deploy Preview https://deploy-preview-737--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

// When the scheduler is initialized with NewScheduler function, this config will be used as default.
// it's possible to call NewSchedulerWithConfig to pass a different argument.

// For build time plugins changes, it's recommended to change the defaultConfig variable in this file.
Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PAY ATTENTION.
when someone wants to change GIE default filters/scorers/picker/etc by forking the repo the flow is the following:

  • fork repo
  • add their own plugins in new files.
  • change ONLY THIS FILE to include their set of plugins using the defaultConfig variable.
  • that's it. build epp.

the idea behind this small file is for such users to be able to keep their fork synced without drifting from upstream.
so since they have only this single file change. they can sync the rest of the code to get bug fixes and such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not to manage two separate files:
one for SchedulerConfig definition which will not be changed in personal fork (since it could be changed in the upstream)
and another file which defines the config instance with all required filters/scorers/picker/...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if SchedulerConfig has changed in upstream (e.g., includes new fields) that means Scheduler itself changed in upstream (cause the scheduler uses the fields from the config).
in such a case it's not possible to keep the previous config in a personal fork and use the scheduler from upstream.

@nirrozenbaum
Copy link
Contributor Author

CC @kfswain @liu-cong

type PodMetrics struct {
score float64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep the score here. It's clean, extensible, and also allows sorting pods based on score easier, than a map from pods to scores.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

score is not a property of PodMetrics snapshot, but a result of calling the Score function.
PodMetrics is a snapshot of metrics at a given point in time and should be a read only interface that helps in plugins calculations during the scheduling cycle (e.g., filter/scorer may use metrics).

we may add a new, different struct, ScoredPod to have the fields Pod and Score and use that instead of the map, which will allow the easy sorting.

type ScoredPod struct {
    Pod types.Pod
    Score float64
}

LMKWYT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, I prefer []ScoredPod than the map, since it makes either for random picking or sorting based on score easier. Another option is to have ScoredPods which can contain both a list and a map, if the map is needed for any reason in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the ScoredPod struct and updated the code to use it.
please pay attention that Score function still defines the return value as map[pod]float which is exactly the definition of a scorer, giving a score to each of the candidate pods.
this map is also very useful for accumulating the weighted score per pod in a fast way (hash).

before passing the pods with the weighted scores to the picker, this is converted into []ScoredPod.

the way I see it, picker needs to be able to iterate over the pods AND to be able to sort pods based on score.
no use of both slice and map at this point. but if it becomes relevant at some point we can add.

@@ -106,21 +112,21 @@ func (s *Scheduler) Schedule(ctx context.Context, req *types.LLMRequest) (*types
// 1. Reduce concurrent access to the datastore.
// 2. Ensure consistent data during the scheduling operation of a request.
sCtx := types.NewSchedulingContext(ctx, req, types.ToSchedulerPodMetrics(s.datastore.PodGetAll()))
loggerDebug.Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.PodsSnapshot))
loggerDebug.Info(fmt.Sprintf("Scheduling a request, Metrics: %+v", sCtx.PodsSnapshot))

s.runPreSchedulePlugins(sCtx)

pods := s.runFilterPlugins(sCtx)
if len(pods) == 0 {
return nil, errutil.Error{Code: errutil.InferencePoolResourceExhausted, Msg: "failed to find a target pod"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think its in scope of this PR, but we should think about changing this error code. there's no guarantee that resources are exhausted when there are no pods, just that the filter for the specific request came up with nothing. I worry that sends incorrect signals to the reader.

Copy link
Contributor

@liu-cong liu-cong Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression from b24f948 I believe. Previously this error was returned by the sheddableRequestFilter which knows this is a resource exhausted error.

I would argue to keep the error from the Filter and Scorers, even if we don't need them today. This is more future proof, and addresses issues like this. @nirrozenbaum What do you think about bringing the errors back?
Not asking to do it in this PR. Can be a follow up.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of today, we have in main the following filters:
LowQueueFilter, LoRAAffinityFilter, LeastQueueFilter, LeastKVCacheFilter, HasCapacityFilter. In addition to those, we're currently implementing few more filters and scorers in IBM that will be added to GIE when ready.
none of the above uses error.
I think this should give us some hints about the usage of error in filters/scorers/picker.

I'm in favor of being flexible, but I think we should be use-case driven and not try to be future proof, otherwise we may end up implementing things that are not used and/or not relevant.

for example, we changed the Score function to get all pods. this is flexible because it allows batching pods to an external scorer and on the same time it also allows scoring each pod separately.
but this change was use-case driven - it is required since we develop in IBM a KVCache scorer which uses redis as an external service and @kfswain also showed in his py-go that it's useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DropRequest is a use case to return an error. How do we solve that without an error?
My point of being future proof is that it will take a lot of refactoring effort if we add it back later, once we have many more plugins in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we started this discussion on #677. I'll put here my interpretation for this question:
from my understanding, DropRequestFilter == FilterAllPods (please correct me if I'm wrong).
in other words, this is a filter which results in an answer of "no pods passed this filter".
this is not unique to this filter and can happen in any filter (scheduler is extensible now, new filters might be written and used).

first point, we can argue if filter that its result is an empty list is an error. I claim that there was no error in the filter itself, as the filter worked correctly and filtered according to its definition. the result is that we cannot serve the request, but this is not an error with the filter (scheduler returns an error if len(filtered_pods) == 0).

second point, and I'll refer to the code as it was in #677 (since it has changed since then) - this error was used in two places in the code:

to summarize, IMO the DropRequest that was previously used is not a good example of an error usage, since this is not an error. it was used (I think incorrectly) to mark that no pods were left. for this check, we can easily check the length of the returned filter pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a lot of great points!

If a filter results in 0 pods, this will effectively lead to an error for the scheduler. While it's debatable whether it's an error from the filter perspective, we need to communicate this to the scheduler. Currently the scheduler simply returns a ResourceExhausted error, which seems too broad. IMO the filter should communicate a reason why it filtered out all pods.

How about adding another "filterReason" return value (perhaps an enum), if an error is too strong, so the scheduler can interpret the empty pods and return a proper error to consumers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree that ResourceExhausted is not the right error to use here and that should be changed.
we're in some intermediate phase and I'm trying to break down the scheduler refactoring into smaller PRs, to allow easier reviews and merge. I will surely change this in the next PR.
I tend to say this discussion is only around filter. once there is at least 1 pod left after filter part, I really don't think we should abort a scheduling cycle if there is a temp issue with one of the scorers, that scorer can return zero value for all candidate pods until the issue is resolved (and write the issue to log). this is very different from the kube-scheduler framework in this aspect.

"filterReason" is another word for saying error :) if there is no error we won't use the reason.
so if we decide on adding such a thing, it's better to use error which can be wrapped in upper layers errors.

can you give an example of "filterReason" you can think of?
I think writing in the returned error which filter is the one that filtered all pods is useful. this is what I planned in my next PR (maybe I can push this in a separate PR today).
but what reason could be other than - the candidate pods doesn't meet the filter criteria?
at the end, it's a boolean condition.

@kfswain
Copy link
Collaborator

kfswain commented Apr 25, 2025

LGTM, left a few comments, will leave to others to give the final LGTM

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 25, 2025
Signed-off-by: Nir Rozenbaum <[email protected]>
Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/hold in case you want to address the nits. Feel free to unhold

String() string
}

type ScoredPod struct {
Pod Pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think embedding the Pod is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea.
will update in a follow up PR not to delay this one.

for _, pod := range pods {
score := s.runScorersForPod(ctx, pod)
pod.SetScore(score)
weightedScorePerPod[pod] = float64(0) // initialize weighted score per pod with 0 value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no scorers are configured (for example like today in main), removing this line will cause errors.
try to remove it locally and run unit test to see the error.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 28, 2025
@nirrozenbaum
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 28, 2025
@k8s-ci-robot k8s-ci-robot merged commit 855436e into kubernetes-sigs:main Apr 28, 2025
8 checks passed
@nirrozenbaum nirrozenbaum deleted the weighted-scorers branch April 28, 2025 09:30
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Apr 28, 2025
* removed unused noop plugin

Signed-off-by: Nir Rozenbaum <[email protected]>

* more scheduler refactoring

Signed-off-by: Nir Rozenbaum <[email protected]>

* more refactoring

Signed-off-by: Nir Rozenbaum <[email protected]>

* added weights to scorers and calculating weighted score

Signed-off-by: Nir Rozenbaum <[email protected]>

* addressed code review comments

Signed-off-by: Nir Rozenbaum <[email protected]>

---------

Signed-off-by: Nir Rozenbaum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants