Skip to content

Commit ad98af2

Browse files
committed
update experiment result
Signed-off-by: Kuromesi <[email protected]>
1 parent db0e132 commit ad98af2

File tree

1 file changed

+66
-0
lines changed

1 file changed

+66
-0
lines changed

docs/experiments.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Experiment Settings
2+
3+
A kubernetes cluster with 5 nodes, and each node has 16 vCPUs, 60 GB of RAM, and an A10 GPU. 10 pods that loaded the Lora Llama2 model are running in this cluster.
4+
5+
# Filter chains
6+
7+
+ baseline:LEAST_REQUEST load balancing algorithm
8+
+ maxium:low latency -> lora-affinity -> least queueing -> least-kv-cache
9+
+ simple-queue:least queueing (with minimum queue size)
10+
+ simple-kvcache:least-kv-cache (with minimum kv-cache size)
11+
+ queue + kvcache:least queueing -> least-kv-cache
12+
13+
# Test Cases
14+
## base-model
15+
Only request llama2 model.
16+
17+
18+
19+
### Performance
20+
| | baseline | maxium | simple-queue | simple-kvcache | queue+kvcache |
21+
| --- | --- | --- | --- | --- | --- |
22+
| requests per minute | 80.09 | 91.68<br/>88.87<br/>93.35<br/>89.70<br/>85.18 | 80.81<br/>80.26 | 89.32<br/>90.10 | 93.21 |
23+
| average time to first token (ttft) | 2.99 | 0.29<br/>0.62<br/>0.87<br/>0.75<br/>0.39 | 1.42<br/>1.10 | 0.85<br/>0.71 | 0.8 |
24+
| ttft P95 | 17.07 | 0.43<br/>1.38<br/>2.21<br/>1.49<br/>0.47 | 5.9<br/>4.52 | 1.78<br/>0.69 | 1.9 |
25+
| Average inter token latency | 0.15 | 0.05<br/>0.05<br/>0.07<br/>0.06<br/>0.05 | 0.09<br/>0.08 | 0.06<br/>0.07 | 0.06 |
26+
| average token throughoutput per second | 16.95 | 18.20<br/>17.33<br/>16.90<br/>17.01<br/>18.04 | 16.25<br/>17.34 | 16.69<br/>18.31 | 16.61 |
27+
| average end to end latency | 25.09 | 21.21<br/>21.85<br/>21.40<br/>21.91<br/>23.27 | 23.69<br/>24.99 | 21.83<br/>22.72 | 20.45 |
28+
| cache utilization | | 73.6% | 74.7% | 77.1% | |
29+
| queue time | | 38.8ms | 236ms | 129ms | |
30+
| preemptions total | | 1047 | 1452 | 784 | |
31+
32+
## multi-lora
33+
34+
Requests to llama2 model with 10 lora, with 40 connections to different lora
35+
36+
| | baseline | maxium | simple-queue | simple-kvcache | queue+kvcache | lora+queue | lora+kvcache |
37+
| --- | --- | --- | --- | --- | --- | --- | --- |
38+
| requests per minute | 54.28<br/>50.554 | 64.29<br/>57.76<br/>57.77 | 49.49 | 46.13<br/>32.68 | 39.14 | 40.84 | 39.88 |
39+
| average time to first token (ttft) | 13.55<br/>11.60 | 4.43<br/>4.49<br/>5.89 | 12.49 | 15.13<br/>14.57 | 11.18 | 5.94 | 10.99 |
40+
| ttft P95 | 56.64<br/>47.87 | 24.60<br/>23.43<br/>33.39 | 43.62 | 96.25<br/>165.83 | 62.17 | 27.39 | 50.76 |
41+
| Average inter token latency | 0.39<br/>0.40<br/> | 0.13<br/>0.13<br/>0.19 | 0.38 | 0.25<br/>0.28 | 0.34 | 0.15 | 0.36 |
42+
| average token throughoutput per second | 14.30<br/>14.63 | 15.78<br/>15.65<br/>15.20 | 13.49 | 16.50<br/>15.87 | 16.24 | 14.89 | 15.98 |
43+
| average end to end latency | 38.65<br/>37.04 | 32.30<br/>30.23<br/>31.58 | 36.39 | 41.81<br/>43.13 | 37.22 | 36.00 | 43.26 |
44+
| average cache utilization | 57.9%<br/>55.5% | 72.8%<br/>61.1%<br/>67.3% | 51.2% | 53.1%<br/>44.8% | 49.9% | 51.0% | 54.9% |
45+
| average queue time | 2.47s<br/>1.91s | 776ms<br/>775ms<br/>457ms | 1.91s | 1.23s<br/>859ms | 863ms | 394ms | 481ms |
46+
| preemptions total | | 720 | | | | 880 | 824 |
47+
48+
## lite-multi-lora
49+
50+
Requests to llama2 model with 4 lora, with 40 connections to different lora
51+
52+
| | baseline | maxium | simple-queue | simple-kvcache | queue+kvcache | lora+queue | lora+kvcache |
53+
| --- | --- | --- | --- | --- | --- | --- | --- |
54+
| requests per minute | 96.34 | 95.77 | 107.34 | 109.50 | 112.34 | 99.27 | 107.55 |
55+
| average time to first token (ttft) | 3.01 | 1.48 | 0.96 | 0.49 | 0.35 | 2.33 | 1.65 |
56+
| ttft P95 | 18.36 | 8.64 | 16.39 | 0.63 | 0.57 | 13.38 | 9.64 |
57+
| Average inter token latency | 0.12 | 0.11 | 0.08 | 0.05 | 0.05 | 0.13 | 0.09 |
58+
| average token throughoutput per second | 16.75 | 17.09 | 16.93 | 17.94 | 17.89 | 14.74 | 16.37 |
59+
| average end to end latency | 20.19 | 19.46 | 18.33 | 17.01 | 17.15 | 20.54 | 18.58 |
60+
| average cache utilization | 59.2% | 68.3% | 72.4% | 72.7% | 74.7% | 65.9% | 65.8% |
61+
| average queue time | 898ms | 464ms | 279ms | 74.3ms | 39ms | 700ms | 501ms |
62+
| preemptions total | 832 | 959 | 923 | 416 | 524 | 1276 | 928 |
63+
64+
65+
66+

0 commit comments

Comments
 (0)