|
| 1 | +# Design |
| 2 | + |
| 3 | +This document describes the high level design for the tool. It includes the |
| 4 | +following components. |
| 5 | + |
| 6 | +## Dataset Preprocessor |
| 7 | + |
| 8 | +Dataset Preprocessor takes in a known dataset like ShareGPT or OpenOrca as the |
| 9 | +input and pre-processes them by making sure the prompt length and generation |
| 10 | +length are aligned with the user input to support different options like fixed |
| 11 | +input / output length tests, variable length tests (larger input / smaller |
| 12 | +output and the vice versa). This allows us to support different GenAI use cases |
| 13 | +like chat completion, summarization, code completion, etc. depending on the |
| 14 | +dataset and the benchmarking user’s inputs. |
| 15 | + |
| 16 | +## Load Generator |
| 17 | + |
| 18 | +Load Generator is the component which generates different traffic patterns based |
| 19 | +on user input. This can include a fixed RPS test for a predetermined amount of |
| 20 | +time or include a way to generate bursts in traffic or other traffic patterns as |
| 21 | +desired for autoscaling and other use cases. |
| 22 | + |
| 23 | +## Request Processor |
| 24 | + |
| 25 | +Request Processor provides a way to support different model servers and their |
| 26 | +corresponding request payload with different configurable parameters. This makes |
| 27 | +our tool model server agnostic and provides a generic way to benchmark different |
| 28 | +model servers and produce apples to apples comparison between them. This |
| 29 | +component will also support different protocols like http and grpc and options |
| 30 | +like request streaming which is important to produce time to first token (TTFT) |
| 31 | +metric. |
| 32 | + |
| 33 | +## Response Processor / Data Collector |
| 34 | + |
| 35 | +Response Processor / Data Collector component allows us to process the response |
| 36 | +and measure the actual performance of the model server in terms of request |
| 37 | +latency, TPOT, TTFT and throughput. |
| 38 | + |
| 39 | +## Report Generator / Metrics Exporter |
| 40 | + |
| 41 | +Report Generator / Metrics Exporter generates a report based on the data |
| 42 | +collected during benchmarking. It can also export the different metrics that we |
| 43 | +collected during benchmarking as metrics into Prometheus which can then be |
| 44 | +consumed by other monitoring or visualization solutions. |
| 45 | + |
| 46 | + |
| 47 | + |
| 48 | +## Metrics to Collect |
| 49 | + |
| 50 | +The following are the essential metrics that we want to collect using the |
| 51 | +benchmarking tool. |
| 52 | + |
| 53 | +* Throughput |
| 54 | + * Output tokens / second |
| 55 | + * Input tokens / second |
| 56 | + * Requests / second |
| 57 | +* Latency at different percentiles (mean, median, p90, p99) |
| 58 | + * Time per output token (TPOT) |
| 59 | + * Inter-token latency (ITL) |
| 60 | + * Time to first token (TTFT) |
| 61 | + * Time per request |
| 62 | +* Request metrics (mean, median, p90, p99) |
| 63 | + * Prompt tokens |
| 64 | + * Output tokens |
| 65 | + |
| 66 | +Optionally we also want to collect specific accelerator and model server metrics. |
| 67 | + |
| 68 | +* Accelerator metrics (mean, median, p90, p99) |
| 69 | + * Accelerator utilization (duty cycle) |
| 70 | + * Accelerator memory utilization |
| 71 | + * Accelerator memory bandwidth utilization |
| 72 | + * Accelerator power usage |
| 73 | +* Model server metrics (mean, median, p90, p99) |
| 74 | + * Batch size |
| 75 | + * Queue size |
| 76 | + * KV cache usage |
0 commit comments