-
Notifications
You must be signed in to change notification settings - Fork 89
Weighted scorers #737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weighted scorers #737
Conversation
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
// When the scheduler is initialized with NewScheduler function, this config will be used as default. | ||
// it's possible to call NewSchedulerWithConfig to pass a different argument. | ||
|
||
// For build time plugins changes, it's recommended to change the defaultConfig variable in this file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PAY ATTENTION.
when someone wants to change GIE default filters/scorers/picker/etc by forking the repo the flow is the following:
- fork repo
- add their own plugins in new files.
- change ONLY THIS FILE to include their set of plugins using the
defaultConfig
variable. - that's it. build epp.
the idea behind this small file is for such users to be able to keep their fork synced without drifting from upstream.
so since they have only this single file change. they can sync the rest of the code to get bug fixes and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not to manage two separate files:
one for SchedulerConfig definition which will not be changed in personal fork (since it could be changed in the upstream)
and another file which defines the config instance with all required filters/scorers/picker/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if SchedulerConfig has changed in upstream (e.g., includes new fields) that means Scheduler itself changed in upstream (cause the scheduler uses the fields from the config).
in such a case it's not possible to keep the previous config in a personal fork and use the scheduler from upstream.
type PodMetrics struct { | ||
score float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will keep the score here. It's clean, extensible, and also allows sorting pods based on score easier, than a map from pods to scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
score is not a property of PodMetrics snapshot, but a result of calling the Score function.
PodMetrics is a snapshot of metrics at a given point in time and should be a read only interface that helps in plugins calculations during the scheduling cycle (e.g., filter/scorer may use metrics).
we may add a new, different struct, ScoredPod
to have the fields Pod
and Score
and use that instead of the map, which will allow the easy sorting.
type ScoredPod struct {
Pod types.Pod
Score float64
}
LMKWYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg, I prefer []ScoredPod than the map, since it makes either for random picking or sorting based on score easier. Another option is to have ScoredPods
which can contain both a list and a map, if the map is needed for any reason in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the ScoredPod
struct and updated the code to use it.
please pay attention that Score
function still defines the return value as map[pod]float which is exactly the definition of a scorer, giving a score to each of the candidate pods.
this map is also very useful for accumulating the weighted score per pod in a fast way (hash).
before passing the pods with the weighted scores to the picker, this is converted into []ScoredPod.
the way I see it, picker needs to be able to iterate over the pods AND to be able to sort pods based on score.
no use of both slice and map at this point. but if it becomes relevant at some point we can add.
@@ -106,21 +112,21 @@ func (s *Scheduler) Schedule(ctx context.Context, req *types.LLMRequest) (*types | |||
// 1. Reduce concurrent access to the datastore. | |||
// 2. Ensure consistent data during the scheduling operation of a request. | |||
sCtx := types.NewSchedulingContext(ctx, req, types.ToSchedulerPodMetrics(s.datastore.PodGetAll())) | |||
loggerDebug.Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.PodsSnapshot)) | |||
loggerDebug.Info(fmt.Sprintf("Scheduling a request, Metrics: %+v", sCtx.PodsSnapshot)) | |||
|
|||
s.runPreSchedulePlugins(sCtx) | |||
|
|||
pods := s.runFilterPlugins(sCtx) | |||
if len(pods) == 0 { | |||
return nil, errutil.Error{Code: errutil.InferencePoolResourceExhausted, Msg: "failed to find a target pod"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think its in scope of this PR, but we should think about changing this error code. there's no guarantee that resources are exhausted when there are no pods, just that the filter for the specific request came up with nothing. I worry that sends incorrect signals to the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a regression from b24f948 I believe. Previously this error was returned by the sheddableRequestFilter which knows this is a resource exhausted error.
I would argue to keep the error from the Filter and Scorers, even if we don't need them today. This is more future proof, and addresses issues like this. @nirrozenbaum What do you think about bringing the errors back?
Not asking to do it in this PR. Can be a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of today, we have in main the following filters:
LowQueueFilter
, LoRAAffinityFilter
, LeastQueueFilter
, LeastKVCacheFilter
, HasCapacityFilter
. In addition to those, we're currently implementing few more filters and scorers in IBM that will be added to GIE when ready.
none of the above uses error.
I think this should give us some hints about the usage of error in filters/scorers/picker.
I'm in favor of being flexible, but I think we should be use-case driven and not try to be future proof, otherwise we may end up implementing things that are not used and/or not relevant.
for example, we changed the Score
function to get all pods. this is flexible because it allows batching pods to an external scorer and on the same time it also allows scoring each pod separately.
but this change was use-case driven - it is required since we develop in IBM a KVCache scorer which uses redis as an external service and @kfswain also showed in his py-go that it's useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DropRequest is a use case to return an error. How do we solve that without an error?
My point of being future proof is that it will take a lot of refactoring effort if we add it back later, once we have many more plugins in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we started this discussion on #677. I'll put here my interpretation for this question:
from my understanding, DropRequestFilter
== FilterAllPods
(please correct me if I'm wrong).
in other words, this is a filter which results in an answer of "no pods passed this filter".
this is not unique to this filter and can happen in any filter (scheduler is extensible now, new filters might be written and used).
first point, we can argue if filter that its result is an empty list is an error. I claim that there was no error in the filter itself, as the filter worked correctly and filtered according to its definition. the result is that we cannot serve the request, but this is not an error with the filter (scheduler returns an error if len(filtered_pods) == 0
).
second point, and I'll refer to the code as it was in #677 (since it has changed since then) - this error was used in two places in the code:
- https://github.com/liu-cong/llm-instance-gateway/blob/83589ebc702ce65c43bf88efcb46d9286ae41bd0/pkg/epp/scheduling/plugins/filter.go#L119-L121
here, the conversion to filter func checks if no pods were left after the filtering, and translates that into an error. - https://github.com/liu-cong/llm-instance-gateway/blob/83589ebc702ce65c43bf88efcb46d9286ae41bd0/pkg/epp/scheduling/plugins/filter.go#L81
here, in the decision tree, there is a checkif err == nil && len(filtered) > 0
. since the only case where error was used is to "mark" that no pods left (in prev point), thiserr ==nil
condition can be removed and checking the second condition is sufficient.
to summarize, IMO the DropRequest that was previously used is not a good example of an error usage, since this is not an error. it was used (I think incorrectly) to mark that no pods were left. for this check, we can easily check the length of the returned filter pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a lot of great points!
If a filter results in 0 pods, this will effectively lead to an error for the scheduler. While it's debatable whether it's an error from the filter perspective, we need to communicate this to the scheduler. Currently the scheduler simply returns a ResourceExhausted error, which seems too broad. IMO the filter should communicate a reason why it filtered out all pods.
How about adding another "filterReason" return value (perhaps an enum), if an error is too strong, so the scheduler can interpret the empty pods and return a proper error to consumers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree that ResourceExhausted is not the right error to use here and that should be changed.
we're in some intermediate phase and I'm trying to break down the scheduler refactoring into smaller PRs, to allow easier reviews and merge. I will surely change this in the next PR.
I tend to say this discussion is only around filter. once there is at least 1 pod left after filter part, I really don't think we should abort a scheduling cycle if there is a temp issue with one of the scorers, that scorer can return zero value for all candidate pods until the issue is resolved (and write the issue to log). this is very different from the kube-scheduler framework in this aspect.
"filterReason" is another word for saying error :) if there is no error we won't use the reason.
so if we decide on adding such a thing, it's better to use error which can be wrapped in upper layers errors.
can you give an example of "filterReason" you can think of?
I think writing in the returned error which filter is the one that filtered all pods is useful. this is what I planned in my next PR (maybe I can push this in a separate PR today).
but what reason could be other than - the candidate pods doesn't meet the filter criteria?
at the end, it's a boolean condition.
LGTM, left a few comments, will leave to others to give the final LGTM /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kfswain, nirrozenbaum The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Nir Rozenbaum <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold in case you want to address the nits. Feel free to unhold
String() string | ||
} | ||
|
||
type ScoredPod struct { | ||
Pod Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think embedding the Pod is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea.
will update in a follow up PR not to delay this one.
for _, pod := range pods { | ||
score := s.runScorersForPod(ctx, pod) | ||
pod.SetScore(score) | ||
weightedScorePerPod[pod] = float64(0) // initialize weighted score per pod with 0 value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no scorers are configured (for example like today in main), removing this line will cause errors.
try to remove it locally and run unit test to see the error.
/unhold |
* removed unused noop plugin Signed-off-by: Nir Rozenbaum <[email protected]> * more scheduler refactoring Signed-off-by: Nir Rozenbaum <[email protected]> * more refactoring Signed-off-by: Nir Rozenbaum <[email protected]> * added weights to scorers and calculating weighted score Signed-off-by: Nir Rozenbaum <[email protected]> * addressed code review comments Signed-off-by: Nir Rozenbaum <[email protected]> --------- Signed-off-by: Nir Rozenbaum <[email protected]>
This PR continues the refactoring of the scheduler and introduces new functionality.
at the very high level:
Score
function now gets all pods instead of one by one. this was raised as a requirement from IBM for scorers that use an external service and not in memory calculations (e.g., send request for redis to calculate kvcache). there is a big benefit is sending all pods in one batch and not one by one. additionally when following the schedule flow it makes much sense in a way that:so every step in the flow is using the output of the previous step.
weighted scorers were added. every scorer is initialized with a weight. the weighted score of a scorer is calculated by multiplying by
(scorer_weight/sum_of_all_scorers_weights)
.unit tests added to test all these changes.
continued with handling some comments from Refactor scheduler to run plugins #677 review, such as removing noopPlugin.
out of scope of this PR and will be added in another follow up PR:
pkg/epp/scheduling/config/config.go
has to be removed. these are values called from env vars for the specific filters. this cannot work if one want to change plugins during build time. instead, each filter should read the relevant information it needs during its NewFilter creation function, or alternatively, should be called by the NewFilter caller and passed as an argument.it definitely shouldn't be mandatory as filters/scorers may change