This is a glossary that deep-dives on terms used within the api proposal, in an effort to give context to the API decisions
This is a very brief description of terms used to describe API objects, this is included only if the glossary is the first doc you are reading.
A grouping of model servers that serve the same set of fine-tunes (LoRA as a primary example).
An LLM workload that is defined and runs on a BackendPool with other use cases.
Priority specifies the importance of a UseCase relative to other usecases within a BackendPool.
For our purposes, priority can be thought of in two classes:
- Critical
- Non-Critical
The primary difference is that non-critical UseCase requests will be rejected in favor of Critical UseCases the face of resource scarcity.
Example:
Your current request load is using 80 Arbitrary Compute Units(ACU) of your pools total of 100ACU capacity. 40ACU are critical workload requests, 45 are non-critical. If you were to lose 30 ACU due to an unforseen outage. Priority would dictate that of the 10 surplus ACU to be rejected the entirety of them would be from the non-critical requests.