-
Notifications
You must be signed in to change notification settings - Fork 89
Add instructions to run benchmarks #480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
265773a
3c5965f
125dcdb
bc28eef
dda25ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# Benchmark | ||
|
||
This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API | ||
inference extension, and a Kubernetes service as the load balancing strategy. The | ||
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) | ||
tool to generate load and collect results. | ||
|
||
## Prerequisites | ||
|
||
### Deploy the inference extension and sample model server | ||
|
||
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the | ||
sample vLLM application, and the inference extension. | ||
|
||
### [Optional] Scale the sample vLLM deployment | ||
|
||
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. | ||
|
||
```bash | ||
kubectl scale deployment my-pool --replicas=8 | ||
``` | ||
|
||
### Expose the model server via a k8s service | ||
|
||
As the baseline, let's also expose the vLLM deployment as a k8s service by simply applying the yaml: | ||
|
||
```bash | ||
kubectl apply -f .manifests/ModelServerService.yaml | ||
``` | ||
|
||
## Run benchmark | ||
|
||
### Run benchmark using the inference extension as the load balancing strategy | ||
|
||
1. Get the gateway IP: | ||
|
||
```bash | ||
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') | ||
echo "Update the <gateway-ip> in ./manifests/BenchmarkInferenceExtension.yaml to: $IP" | ||
``` | ||
|
||
1. Then update the `<gateway-ip>` in `./manifests/BenchmarkInferenceExtension.yaml` to the IP | ||
of the gateway. Feel free to adjust other parameters such as request_rates as well. | ||
|
||
1. Start the benchmark tool. `kubectl apply -f ./manifests/BenchmarkInferenceExtension.yaml` | ||
|
||
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable | ||
liu-cong marked this conversation as resolved.
Show resolved
Hide resolved
|
||
to specify what this benchmark is for. In this case, the result is for the `inference-extension`. You | ||
can use any id you like. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`, | ||
the script below will watch for that log line and then start downloading results. | ||
|
||
```bash | ||
benchmark_id='inference-extension' ./download-benchmark-results.bash | ||
``` | ||
|
||
1. After the script finishes, you should see benchmark results under `./output/default-run/inference-extension/results/json` folder. | ||
|
||
### Run benchmark using k8s service as the load balancing strategy | ||
|
||
1. Get the service IP: | ||
|
||
```bash | ||
IP=$(kubectl get service/my-pool-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}') | ||
echo "Update the <svc-ip> in ./manifests/BenchmarkK8sService.yaml to: $IP" | ||
``` | ||
|
||
2. Then update the `<svc-ip>` in `./manifests/BenchmarkK8sService.yaml` to the IP | ||
of the service. Feel free to adjust other parameters such as **request_rates** as well. | ||
|
||
1. Start the benchmark tool. `kubectl apply -f ./manifests/BenchmarkK8sService.yaml` | ||
|
||
2. Wait for benchmark to finish and download the results. | ||
|
||
```bash | ||
benchmark_id='k8s-svc' ./download-benchmark-results.bash | ||
``` | ||
|
||
3. After the script finishes, you should see benchmark results under `./output/default-run/k8s-svc/results/json` folder. | ||
|
||
### Tips | ||
|
||
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script. | ||
This is useful when you run benchmarks multiple times and group the results accordingly. | ||
* Update the `request_rates` that best suit your benchmark environment. | ||
|
||
## Advanced Benchmark Configurations | ||
|
||
Pls refer to https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark for a detailed list of configuration knobs. | ||
|
||
## Analyze the results | ||
|
||
This guide shows how to run the jupyter notebook using vscode. | ||
|
||
1. Create a python virtual environment. | ||
|
||
```bash | ||
python3 -m venv .venv | ||
source .venv/bin/activate | ||
``` | ||
|
||
1. Install the dependencies. | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
1. Open the notebook `Inference_Extension_Benchmark.ipynb`, and run each cell. At the end you should | ||
see a bar chart like below: | ||
|
||
 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
#!/bin/bash | ||
|
||
# Downloads the benchmark result files from the benchmark tool pod. | ||
download_benchmark_results() { | ||
until echo $(kubectl logs deployment/benchmark-tool -n ${namespace}) | grep -q -m 1 "LPG_FINISHED"; do sleep 30 ; done; | ||
benchmark_pod=$(kubectl get pods -l app=benchmark-tool -n ${namespace} -o jsonpath="{.items[0].metadata.name}") | ||
echo "Downloading JSON results from pod ${benchmark_pod}" | ||
kubectl exec ${benchmark_pod} -n ${namespace} -- rm -f ShareGPT_V3_unfiltered_cleaned_split.json | ||
for f in $(kubectl exec ${benchmark_pod} -n ${namespace} -- /bin/sh -c ls -l | grep json); do | ||
echo "Downloading json file ${f}" | ||
kubectl cp -n ${namespace} ${benchmark_pod}:$f ${benchmark_output_dir}/results/json/$f; | ||
done | ||
} | ||
|
||
# Env vars to be passed when calling this script. | ||
# The id of the benchmark. This is needed to identify what the benchmark is for. | ||
# It decides the filepath to save the results, which later is used by the jupyter notebook to assign | ||
# the benchmark_id as data labels for plotting. | ||
benchmark_id=${benchmark_id:-"inference-extension"} | ||
# run_id can be used to group different runs of the same benchmarks for comparison. | ||
run_id=${run_id:-"default-run"} | ||
namespace=${namespace:-"default"} | ||
output_dir=${output_dir:-'output'} | ||
|
||
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" | ||
benchmark_output_dir=${SCRIPT_DIR}/${output_dir}/${run_id}/${benchmark_id} | ||
|
||
echo "Saving benchmark results to ${benchmark_output_dir}/results/json/" | ||
download_benchmark_results |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @achandrasekar How would one start another run? Should we use a Job here instead, something that runs to completion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought about this as well. A deployment is convenient in that it keeps the pods running so we can download the result files from the pod, otherwise we need to set up some persistent storage such as s3 or GCS, not every user has access to those. This is also aligns with the user guide of the lpg tool. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can give users option to export the result to s3 or GCS in the job. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the pod/job/files stays around after it completes, so should still be able to d/l the results? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I took the approach that requires minimal dependencies. Yes using a persistent volumes such as S3 works as well, but it requires additional configuration. We can add that option later.
You will need some persistent volume. I updated the download-benchmark-result.sh script to tear down the deployment after it downloads the results. |
||
metadata: | ||
labels: | ||
app: benchmark-tool | ||
name: benchmark-tool | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: benchmark-tool | ||
template: | ||
metadata: | ||
labels: | ||
app: benchmark-tool | ||
spec: | ||
containers: | ||
- image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438' | ||
imagePullPolicy: Always | ||
name: benchmark-tool | ||
command: | ||
- bash | ||
- -c | ||
- ./latency_throughput_curve.sh | ||
env: | ||
- name: IP | ||
value: '<gateway-ip>' | ||
# value: 'envoy-default-inference-gateway-6454a873.envoy-gateway-system.svc.cluster.local' | ||
- name: REQUEST_RATES | ||
value: '40,80,120,160,200' | ||
- name: BENCHMARK_TIME_SECONDS | ||
value: '60' | ||
- name: TOKENIZER | ||
value: 'meta-llama/Llama-2-7b-hf' | ||
- name: MODELS | ||
value: 'meta-llama/Llama-2-7b-hf' | ||
- name: BACKEND | ||
value: vllm | ||
- name: PORT | ||
value: "8081" | ||
- name: INPUT_LENGTH | ||
value: "1024" | ||
- name: OUTPUT_LENGTH | ||
value: '2048' | ||
- name: FILE_PREFIX | ||
value: benchmark | ||
- name: PROMPT_DATASET_FILE | ||
value: ShareGPT_V3_unfiltered_cleaned_split.json | ||
- name: HF_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
key: token | ||
name: hf-token | ||
resources: | ||
limits: | ||
cpu: "2" | ||
memory: 20Gi | ||
requests: | ||
cpu: "2" | ||
memory: 20Gi |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
labels: | ||
app: benchmark-tool | ||
name: benchmark-tool | ||
spec: | ||
replicas: 1 | ||
selector: | ||
matchLabels: | ||
app: benchmark-tool | ||
template: | ||
metadata: | ||
labels: | ||
app: benchmark-tool | ||
spec: | ||
containers: | ||
- image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438' | ||
liu-cong marked this conversation as resolved.
Show resolved
Hide resolved
|
||
imagePullPolicy: Always | ||
name: benchmark-tool | ||
command: | ||
- bash | ||
- -c | ||
- ./latency_throughput_curve.sh | ||
env: | ||
- name: IP | ||
value: 'my-pool-service.default.svc.cluster.local' | ||
- name: REQUEST_RATES | ||
value: '40,80,120,160,200' | ||
- name: BENCHMARK_TIME_SECONDS | ||
value: '60' | ||
- name: TOKENIZER | ||
value: 'meta-llama/Llama-2-7b-hf' | ||
- name: MODELS | ||
value: 'meta-llama/Llama-2-7b-hf' | ||
- name: BACKEND | ||
value: vllm | ||
- name: PORT | ||
value: "8081" | ||
- name: INPUT_LENGTH | ||
value: "1024" | ||
- name: OUTPUT_LENGTH | ||
value: '2048' | ||
- name: FILE_PREFIX | ||
value: benchmark | ||
- name: PROMPT_DATASET_FILE | ||
value: ShareGPT_V3_unfiltered_cleaned_split.json | ||
- name: HF_TOKEN | ||
valueFrom: | ||
secretKeyRef: | ||
key: token | ||
name: hf-token | ||
resources: | ||
limits: | ||
cpu: "2" | ||
memory: 20Gi | ||
requests: | ||
cpu: "2" | ||
memory: 20Gi |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
apiVersion: v1 | ||
kind: Service | ||
liu-cong marked this conversation as resolved.
Show resolved
Hide resolved
|
||
metadata: | ||
name: my-pool-service | ||
spec: | ||
ports: | ||
- port: 8081 | ||
protocol: TCP | ||
targetPort: 8000 | ||
selector: | ||
app: my-pool | ||
type: LoadBalancer |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
pandas | ||
numpy | ||
matplotlib |
Uh oh!
There was an error while loading. Please reload this page.