Skip to content

Commit e1dd5b6

Browse files
committed
update benchmark index.md
1 parent 7e51e02 commit e1dd5b6

File tree

1 file changed

+23
-38
lines changed

1 file changed

+23
-38
lines changed
Lines changed: 23 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,45 @@
11
# Benchmark
22

3-
This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API
4-
Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the
5-
[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate
6-
load and collect results.
3+
This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
4+
inference extension, and a Kubernetes service as the load balancing strategy. The
5+
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
6+
tool to generate load and collect results.
77

88
## Prerequisites
99

1010
### Deploy the inference extension and sample model server
1111

12-
Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension)
13-
to deploy the vLLM model server, CRDs, etc.
14-
15-
__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing.
12+
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
13+
sample vLLM application, and the inference extension.
1614

1715
### [Optional] Scale the sample vLLM deployment
1816

19-
You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
17+
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
2018

2119
```bash
22-
kubectl scale deployment vllm-llama3-8b-instruct --replicas=8
20+
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
2321
```
2422

2523
### Expose the model server via a k8s service
2624

27-
To establish a baseline, expose the vLLM deployment as a k8s service:
25+
As the baseline, let's also expose the vLLM deployment as a k8s service:
2826

2927
```bash
30-
kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer
28+
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
3129
```
3230

3331
## Run benchmark
3432

35-
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results.
36-
Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in
37-
parallel against different targets.
33+
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
3834

3935
1. Check out the repo.
40-
36+
4137
```bash
4238
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
4339
cd gateway-api-inference-extension
4440
```
4541

46-
1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service.
42+
1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
4743

4844
```bash
4945
# Get gateway IP
@@ -55,43 +51,32 @@ parallel against different targets.
5551
echo $SVC_IP
5652
```
5753

58-
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`.
59-
Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the
60-
[LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
54+
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
6155

62-
1. Start the benchmark tool.
56+
1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml`
6357

64-
```bash
65-
kubectl apply -f ./config/manifests/benchmark/benchmark.yaml
66-
```
67-
68-
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this
69-
benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print
70-
a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results.
58+
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
59+
to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`,
60+
the script below will watch for that log line and then start downloading results.
7161

7262
```bash
73-
benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash
63+
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
7464
```
75-
76-
After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder.
77-
Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark.
65+
1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json).
7866

7967
### Tips
8068

81-
* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be
82-
updated accordingly to analyze the results.
8369
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
8470
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
8571
* Update the `request_rates` that best suit your benchmark environment.
8672

8773
### Advanced Benchmark Configurations
8874

89-
Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a
90-
detailed list of configuration knobs.
75+
Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
9176

9277
## Analyze the results
9378

94-
This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks.
79+
This guide shows how to run the jupyter notebook using vscode.
9580

9681
1. Create a python virtual environment.
9782

@@ -109,4 +94,4 @@ This guide shows how to run the jupyter notebook using vscode after completing k
10994
1. Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. In the last cell update the benchmark ids with`inference-extension` and `k8s-svc`. At the end you should
11095
see a bar chart like below where **"ie"** represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json).
11196

112-
![alt text](example-bar-chart.png)
97+
![alt text](example-bar-chart.png)

0 commit comments

Comments
 (0)