Skip to content

Files

Latest commit

Jan 20, 2025
ad98af2 · Jan 20, 2025

History

History
66 lines (49 loc) · 3.77 KB

experiments.md

File metadata and controls

66 lines (49 loc) · 3.77 KB

Experiment Settings

A kubernetes cluster with 5 nodes, and each node has 16 vCPUs, 60 GB of RAM, and an A10 GPU. 10 pods that loaded the Lora Llama2 model are running in this cluster.

Filter chains

  • baseline:LEAST_REQUEST load balancing algorithm
  • maxium:low latency -> lora-affinity -> least queueing -> least-kv-cache
  • simple-queue:least queueing (with minimum queue size)
  • simple-kvcache:least-kv-cache (with minimum kv-cache size)
  • queue + kvcache:least queueing -> least-kv-cache

Test Cases

base-model

Only request llama2 model.

Performance

baseline maxium simple-queue simple-kvcache queue+kvcache
requests per minute 80.09 91.68
88.87
93.35
89.70
85.18
80.81
80.26
89.32
90.10
93.21
average time to first token (ttft) 2.99 0.29
0.62
0.87
0.75
0.39
1.42
1.10
0.85
0.71
0.8
ttft P95 17.07 0.43
1.38
2.21
1.49
0.47
5.9
4.52
1.78
0.69
1.9
Average inter token latency 0.15 0.05
0.05
0.07
0.06
0.05
0.09
0.08
0.06
0.07
0.06
average token throughoutput per second 16.95 18.20
17.33
16.90
17.01
18.04
16.25
17.34
16.69
18.31
16.61
average end to end latency 25.09 21.21
21.85
21.40
21.91
23.27
23.69
24.99
21.83
22.72
20.45
cache utilization 73.6% 74.7% 77.1%
queue time 38.8ms 236ms 129ms
preemptions total 1047 1452 784

multi-lora

Requests to llama2 model with 10 lora, with 40 connections to different lora

baseline maxium simple-queue simple-kvcache queue+kvcache lora+queue lora+kvcache
requests per minute 54.28
50.554
64.29
57.76
57.77
49.49 46.13
32.68
39.14 40.84 39.88
average time to first token (ttft) 13.55
11.60
4.43
4.49
5.89
12.49 15.13
14.57
11.18 5.94 10.99
ttft P95 56.64
47.87
24.60
23.43
33.39
43.62 96.25
165.83
62.17 27.39 50.76
Average inter token latency 0.39
0.40
0.13
0.13
0.19
0.38 0.25
0.28
0.34 0.15 0.36
average token throughoutput per second 14.30
14.63
15.78
15.65
15.20
13.49 16.50
15.87
16.24 14.89 15.98
average end to end latency 38.65
37.04
32.30
30.23
31.58
36.39 41.81
43.13
37.22 36.00 43.26
average cache utilization 57.9%
55.5%
72.8%
61.1%
67.3%
51.2% 53.1%
44.8%
49.9% 51.0% 54.9%
average queue time 2.47s
1.91s
776ms
775ms
457ms
1.91s 1.23s
859ms
863ms 394ms 481ms
preemptions total 720 880 824

lite-multi-lora

Requests to llama2 model with 4 lora, with 40 connections to different lora

baseline maxium simple-queue simple-kvcache queue+kvcache lora+queue lora+kvcache
requests per minute 96.34 95.77 107.34 109.50 112.34 99.27 107.55
average time to first token (ttft) 3.01 1.48 0.96 0.49 0.35 2.33 1.65
ttft P95 18.36 8.64 16.39 0.63 0.57 13.38 9.64
Average inter token latency 0.12 0.11 0.08 0.05 0.05 0.13 0.09
average token throughoutput per second 16.75 17.09 16.93 17.94 17.89 14.74 16.37
average end to end latency 20.19 19.46 18.33 17.01 17.15 20.54 18.58
average cache utilization 59.2% 68.3% 72.4% 72.7% 74.7% 65.9% 65.8%
average queue time 898ms 464ms 279ms 74.3ms 39ms 700ms 501ms
preemptions total 832 959 923 416 524 1276 928