Skip to content

Commit 08d038b

Browse files
authored
Merge pull request #4 from achandrasekar/design
Add design document to the repo
2 parents db97980 + 4165fe9 commit 08d038b

File tree

2 files changed

+76
-0
lines changed

2 files changed

+76
-0
lines changed

Diff for: docs/design.md

+76
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Design
2+
3+
This document describes the high level design for the tool. It includes the
4+
following components.
5+
6+
## Dataset Preprocessor
7+
8+
Dataset Preprocessor takes in a known dataset like ShareGPT or OpenOrca as the
9+
input and pre-processes them by making sure the prompt length and generation
10+
length are aligned with the user input to support different options like fixed
11+
input / output length tests, variable length tests (larger input / smaller
12+
output and the vice versa). This allows us to support different GenAI use cases
13+
like chat completion, summarization, code completion, etc. depending on the
14+
dataset and the benchmarking user’s inputs.
15+
16+
## Load Generator
17+
18+
Load Generator is the component which generates different traffic patterns based
19+
on user input. This can include a fixed RPS test for a predetermined amount of
20+
time or include a way to generate bursts in traffic or other traffic patterns as
21+
desired for autoscaling and other use cases.
22+
23+
## Request Processor
24+
25+
Request Processor provides a way to support different model servers and their
26+
corresponding request payload with different configurable parameters. This makes
27+
our tool model server agnostic and provides a generic way to benchmark different
28+
model servers and produce apples to apples comparison between them. This
29+
component will also support different protocols like http and grpc and options
30+
like request streaming which is important to produce time to first token (TTFT)
31+
metric.
32+
33+
## Response Processor / Data Collector
34+
35+
Response Processor / Data Collector component allows us to process the response
36+
and measure the actual performance of the model server in terms of request
37+
latency, TPOT, TTFT and throughput.
38+
39+
## Report Generator / Metrics Exporter
40+
41+
Report Generator / Metrics Exporter generates a report based on the data
42+
collected during benchmarking. It can also export the different metrics that we
43+
collected during benchmarking as metrics into Prometheus which can then be
44+
consumed by other monitoring or visualization solutions.
45+
46+
![benchmarking-tool-architecture](./images/design.png)
47+
48+
## Metrics to Collect
49+
50+
The following are the essential metrics that we want to collect using the
51+
benchmarking tool.
52+
53+
* Throughput
54+
* Output tokens / second
55+
* Input tokens / second
56+
* Requests / second
57+
* Latency at different percentiles (mean, median, p90, p99)
58+
* Time per output token (TPOT)
59+
* Inter-token latency (ITL)
60+
* Time to first token (TTFT)
61+
* Time per request
62+
* Request metrics (mean, median, p90, p99)
63+
* Prompt tokens
64+
* Output tokens
65+
66+
Optionally we also want to collect specific accelerator and model server metrics.
67+
68+
* Accelerator metrics (mean, median, p90, p99)
69+
* Accelerator utilization (duty cycle)
70+
* Accelerator memory utilization
71+
* Accelerator memory bandwidth utilization
72+
* Accelerator power usage
73+
* Model server metrics (mean, median, p90, p99)
74+
* Batch size
75+
* Queue size
76+
* KV cache usage

Diff for: docs/images/design.png

52.6 KB
Loading

0 commit comments

Comments
 (0)