generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 69
API Proposal #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
API Proposal #5
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
3590f78
Big ol proposal commit
kfswain b5d0c05
1st round of addressing comments
kfswain 1dabf98
Partial glossary implementation
kfswain 6ef1add
API updates
kfswain 620c834
glossary additions
kfswain ecc4015
adding axioms and faq
kfswain 9f24103
Addressing review comments
kfswain 8d26e3b
more proposal updates
kfswain 1f3abf2
clarifying plural modelgroup
kfswain 8418672
editing comment to max col 80 len
kfswain 3e1a5bf
More explicit documentation on plural usecases
kfswain 97131ef
Adding examples and word clarification
kfswain 7fc7879
Adding persona summary to ModelGroup obj
kfswain c3bb56b
Changing Constants to adhere to style guide
kfswain 5ad7e8f
Grammatical fixes
kfswain 6bacaf1
Changing BackendPool acronym
kfswain e6e4360
CUJ description clarification
kfswain d385c80
Adding back BackendPool with description
kfswain 439e7ef
Updating names to LLM Service and LLMServerPool
kfswain 2285b69
Typos, rewording, and small fixes
kfswain 649d2c3
another review pass
kfswain 063a80d
link fixes
kfswain 54d0543
fixing wording, removing duplication
kfswain 75861c2
shortining targetmodel name field
kfswain File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Glossary | ||
|
||
This is a glossary that attempts to more thoroughly explain terms used within the api proposal, in an effort to give context to API decisions. | ||
|
||
<!-- toc --> | ||
- [API Terms](#api) | ||
- [LLMServerPool](#llmserverpool) | ||
- [LLMService](#llmservice) | ||
- [Capacity Constrained Routing](#capacity-constrained-routing) | ||
- [Priority](#priority) | ||
- [Fairness](#fairness) | ||
- [General Routing](#general-routing) | ||
- [Latency Based Routing](#latency-based-routing) | ||
- [Lora Affinity](#lora-affinity) | ||
|
||
|
||
<!-- /toc --> | ||
|
||
## API | ||
This is a very brief description of terms used to describe API objects, included for completeness. | ||
|
||
### LLMServerPool | ||
A grouping of model servers that serve the same set of fine-tunes (LoRA as a primary example). | ||
|
||
Shortened to: `LSP` | ||
|
||
### LLMService | ||
An LLM workload that is defined and runs on a LLMServerPool with other use cases. | ||
|
||
# Capacity Constrained Routing | ||
|
||
## Priority | ||
|
||
### Summary | ||
Priority specifies the importance of a LLMService relative to other services within a LLMServerPool. | ||
|
||
### Description | ||
|
||
For our purposes, priority can be thought of in two classes: | ||
- Critical | ||
- Non-Critical | ||
|
||
The primary difference is that non-critical LLMService requests will be rejected in favor of Critical LLMServices the face of resource scarcity. | ||
|
||
Example: | ||
|
||
Your current request load is using 80 Arbitrary Compute Units(ACU) of your pools total of 100ACU capacity. 40ACU are critical workload requests, 40 are non-critical. If you were to lose 30 ACU due to an unforseen outage. Priority would dictate that of the 10 surplus ACU to be rejected, the entirety of them would be from the _non-critical_ requests. | ||
|
||
## Fairness | ||
|
||
### Summary | ||
Fairness specifies how resources are shared among different LLMServices, in a way that is most acceptable to the user. | ||
|
||
### Description | ||
|
||
Fairness, like priority, is only used in resource scarcity events. | ||
|
||
Fairness is utilized when requests of the same priority class need to be rejected, or queued. There are many dimensions that could be considered when considering shared resources. To name a few: | ||
- KV-cache utilization | ||
- Total request count | ||
- SLO adherence | ||
|
||
For the v1 MVP, the only objective a User can specify is the SLO objective they would like to meet. So, in following that pattern, fairness in MVP will simply be considered for SLO adherence. SLO Adherence is only being considered over a rolling time window of data. | ||
|
||
The TTL we are currently assuming is: `5 min` | ||
|
||
### Example | ||
|
||
**Assumption:** Services have equally weighted fairness for this example. | ||
|
||
- Service A has been meeting its SLO 98% of the requests made in the time window, and Service B has met the SLO 94% of the time. | ||
|
||
- A request for both Service A and Service B come in at the same time, and there is only capacity to start a single new request in the LSP, this capacity would meet the SLO for both services. The other request would be queued (potentially causing that request to not meet SLO). | ||
|
||
- To fairly share these resources. Service B *must* be selected to begin the request immediately as Service A has had its SLO met a larger percentage of the time. | ||
|
||
# General Routing | ||
Different from the previous definitons, these terms are used to describe methods of routing that are constant, and seek to better utilize compute resources to avoid capacity constraints as much as possible. | ||
|
||
## Latency Based Routing | ||
|
||
### Summary | ||
Latency Based Routing uses data to ensure LLMServices meet their specified SLO. | ||
|
||
### Description | ||
Data collected from the model servers and data collected from the request is used to predict the time a request will take on a *specific* model server, and route in a way that will best satisfy the SLO of the incoming requests. | ||
|
||
## Lora Affinity | ||
|
||
### Summary | ||
LoRA Affinity describes the routing strategy displayed in the [demo](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458), to better utilize Model Servers within the LSP. | ||
|
||
### Description | ||
Model Servers that support multi-LoRA handle requests in a FCFS basis. By utilizing the data provided by the model server (the state of loaded LoRA adapters), a routing system can route requests for a given LoRA adapter, to a model server that already has that adapter loaded, to create larger batches than a naive route, which better utilizes the model server hardware. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think latency based routing has to do with priority. As far as the concerned useCases here are the one with an Objective (Critical). It's more likely how under the hood we prioritize critical useCases within a BackendPool: which useCase should I route in priority to the best available Backend.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, I try to describe Priority up above. This was meant to just describe what Latency based routing means when we reference it, to help explain the SLO field. On the SLO field we mention that priority is implicitly added.
LMK if that suffices or if you feel we should go into further detail here. Thanks!