Skip to content

Commit 0f4522e

Browse files
authored
API Proposal (#5)
* Big ol proposal commit * 1st round of addressing comments * Partial glossary implementation * API updates * glossary additions * adding axioms and faq * Addressing review comments * more proposal updates * clarifying plural modelgroup * editing comment to max col 80 len * More explicit documentation on plural usecases * Adding examples and word clarification * Adding persona summary to ModelGroup obj * Changing Constants to adhere to style guide * Grammatical fixes * Changing BackendPool acronym * CUJ description clarification * Adding back BackendPool with description * Updating names to LLM Service and LLMServerPool * Typos, rewording, and small fixes * another review pass * link fixes * fixing wording, removing duplication * shortining targetmodel name field
1 parent d94bd83 commit 0f4522e

File tree

4 files changed

+439
-0
lines changed

4 files changed

+439
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Glossary
2+
3+
This is a glossary that attempts to more thoroughly explain terms used within the api proposal, in an effort to give context to API decisions.
4+
5+
<!-- toc -->
6+
- [API Terms](#api)
7+
- [LLMServerPool](#llmserverpool)
8+
- [LLMService](#llmservice)
9+
- [Capacity Constrained Routing](#capacity-constrained-routing)
10+
- [Priority](#priority)
11+
- [Fairness](#fairness)
12+
- [General Routing](#general-routing)
13+
- [Latency Based Routing](#latency-based-routing)
14+
- [Lora Affinity](#lora-affinity)
15+
16+
17+
<!-- /toc -->
18+
19+
## API
20+
This is a very brief description of terms used to describe API objects, included for completeness.
21+
22+
### LLMServerPool
23+
A grouping of model servers that serve the same set of fine-tunes (LoRA as a primary example).
24+
25+
Shortened to: `LSP`
26+
27+
### LLMService
28+
An LLM workload that is defined and runs on a LLMServerPool with other use cases.
29+
30+
# Capacity Constrained Routing
31+
32+
## Priority
33+
34+
### Summary
35+
Priority specifies the importance of a LLMService relative to other services within a LLMServerPool.
36+
37+
### Description
38+
39+
For our purposes, priority can be thought of in two classes:
40+
- Critical
41+
- Non-Critical
42+
43+
The primary difference is that non-critical LLMService requests will be rejected in favor of Critical LLMServices the face of resource scarcity.
44+
45+
Example:
46+
47+
Your current request load is using 80 Arbitrary Compute Units(ACU) of your pools total of 100ACU capacity. 40ACU are critical workload requests, 40 are non-critical. If you were to lose 30 ACU due to an unforseen outage. Priority would dictate that of the 10 surplus ACU to be rejected, the entirety of them would be from the _non-critical_ requests.
48+
49+
## Fairness
50+
51+
### Summary
52+
Fairness specifies how resources are shared among different LLMServices, in a way that is most acceptable to the user.
53+
54+
### Description
55+
56+
Fairness, like priority, is only used in resource scarcity events.
57+
58+
Fairness is utilized when requests of the same priority class need to be rejected, or queued. There are many dimensions that could be considered when considering shared resources. To name a few:
59+
- KV-cache utilization
60+
- Total request count
61+
- SLO adherence
62+
63+
For the v1 MVP, the only objective a User can specify is the SLO objective they would like to meet. So, in following that pattern, fairness in MVP will simply be considered for SLO adherence. SLO Adherence is only being considered over a rolling time window of data.
64+
65+
The TTL we are currently assuming is: `5 min`
66+
67+
### Example
68+
69+
**Assumption:** Services have equally weighted fairness for this example.
70+
71+
- Service A has been meeting its SLO 98% of the requests made in the time window, and Service B has met the SLO 94% of the time.
72+
73+
- A request for both Service A and Service B come in at the same time, and there is only capacity to start a single new request in the LSP, this capacity would meet the SLO for both services. The other request would be queued (potentially causing that request to not meet SLO).
74+
75+
- To fairly share these resources. Service B *must* be selected to begin the request immediately as Service A has had its SLO met a larger percentage of the time.
76+
77+
# General Routing
78+
Different from the previous definitons, these terms are used to describe methods of routing that are constant, and seek to better utilize compute resources to avoid capacity constraints as much as possible.
79+
80+
## Latency Based Routing
81+
82+
### Summary
83+
Latency Based Routing uses data to ensure LLMServices meet their specified SLO.
84+
85+
### Description
86+
Data collected from the model servers and data collected from the request is used to predict the time a request will take on a *specific* model server, and route in a way that will best satisfy the SLO of the incoming requests.
87+
88+
## Lora Affinity
89+
90+
### Summary
91+
LoRA Affinity describes the routing strategy displayed in the [demo](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458), to better utilize Model Servers within the LSP.
92+
93+
### Description
94+
Model Servers that support multi-LoRA handle requests in a FCFS basis. By utilizing the data provided by the model server (the state of loaded LoRA adapters), a routing system can route requests for a given LoRA adapter, to a model server that already has that adapter loaded, to create larger batches than a naive route, which better utilizes the model server hardware.

docs/proposals/002-api-proposal/images/gw_w_lsp.svg

+1
Loading

docs/proposals/002-api-proposal/images/lsp.svg

+1
Loading

0 commit comments

Comments
 (0)