Skip to content

Simplify POC installation #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 31 additions & 44 deletions examples/poc/README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,55 @@
# Envoy Ext Proc Gateway with LoRA Integration

This project sets up an Envoy gateway to handle gRPC calls with integration of LoRA (Low-Rank Adaptation). The configuration aims to manage gRPC traffic through Envoy's external processing and custom routing based on headers and load balancing rules. The setup includes Kubernetes services and deployments for both the gRPC server and the vllm-lora application.
This project sets up an Envoy gateway with a custom external processing which implements advanced routing logic tailored for LoRA (Low-Rank Adaptation) adapters. The routing algorithm is based on the model specified (using Open AI API format), and ensuring efficient load balancing based on model server metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more extra space between which and implements


![alt text](./doc/envoy-gateway-bootstrap.png)

## Requirements
- A vLLM based deployment (using the custom image provided below), with LoRA Adapters
- Kubernetes cluster
- Envoy Gateway v1.1 installed on your cluster: https://gateway.envoyproxy.io/v1.1/tasks/quickstart/
- `kubectl` command-line tool
- Go (for local development)

## vLLM
***This PoC uses a modified vLLM fork, the public image of the fork is here: `ghcr.io/tomatillo-and-multiverse/vllm:demo`***

The fork is here: https://github.com/kaushikmitr/vllm.

The summary of changes from standard vLLM are:
- Active/Registered LoRA adapters are returned as a response header (used for lora-aware routing)
- Queue size is returned as a response header
- Active/Registered LoRA adapters are emitted as metrics (for out-of-band scraping during low traffic periods)


## Overview

This project contains the necessary configurations and code to set up and deploy a service using Kubernetes, Envoy, and Go. The service involves routing based on the model specified (using Open AI API format), collecting metrics, and ensuring efficient load balancing.

![alt text](./envoy-gateway-bootstrap.png)

- A vLLM based deployment using a custom fork, with LoRA Adapters. ***This PoC uses a modified vLLM [fork](https://github.com/kaushikmitr/vllm), the public image of the fork is here: `ghcr.io/tomatillo-and-multiverse/vllm:demo`***. A sample deployement is provided under `./manifests/samples/vllm-lora-deployment.yaml`.

## Quickstart

### Steps
1. **Deploy Sample vLLM Application**
NOTE: Create a HuggingFace API token and store it in a secret named `hf-token` with key hf_api_token`. This is configured in the `HUGGING_FACE_HUB_TOKEN` and `HF_TOKEN` environment variables in `./manifests/samples/vllm-lora-deployment.yaml`.

1. **Apply Kubernetes Manifests**
```bash
cd manifests
kubectl apply -f ext_proc.yaml
kubectl apply -f vllm/vllm-lora-service.yaml
kubectl apply -f vllm/vllm-lora-deployment.yaml
kubectl apply -f ./manifests/samples/vllm-lora-deployment.yaml
kubectl apply -f ./manifests/samples/vllm-lora-service.yaml
```
2. **Install GatewayClass with Ext Proc**
A custom GatewayClass `llm-gateway` which is configured with the llm routing ext proc will be installed into the `llm-gateway` namespace. It's configured to listen on port 8081 for traffic through ext-proc (in addition to the default 8080), see the `EnvoyProxy` configuration in `installation.yaml`. When you create Gateways, make sure the `llm-gateway` GatewayClass is used.

2. **Update `ext_proc.yaml`**
- Ensure the `ext_proc.yaml` is updated with the pod names and internal IP addresses of the vLLM replicas. This step is crucial for the correct routing of requests based on headers.
NOTE: Ensure the `llm-route-ext-proc` deployment is updated with the pod names and internal IP addresses of the vLLM replicas. This step is crucial for the correct routing of requests based on headers. This won't be needed once we make ext proc dynamically read the pods.

2. **Update and apply `gateway.yaml`**
- Ensure the `gateway.yaml` is updated with the internal IP addresses of the ExtProc service. This step is also crucial for the correct routing of requests based on headers.
```bash
cd manifests
kubectl apply -f gateway.yaml
```bash
kubectl apply -f ./manifests/installation.yaml
```
3. **Deploy Gateway**

```bash
kubectl apply -f ./manifests/samples/gateway.yaml
```

### Monitoring and Metrics

- The Go application collects metrics and saves the latest response headers in memory.
- Ensure Envoy is configured to route based on the metrics collected from the `/metric` endpoint of different service pods.

## Contributing
4. **Try it out**
Wait until the gateway is ready.
```bash
IP=$(kubectl get gateway/llm-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=8081
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this wrong, llm-instance-gw is listening on 8080. Am I wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Actually in the POC setup, Envoy is configured the additional 8081 port for ext proc traffic. Updated README

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ho yes, you're right actually. Sorry, for the confusion.


curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
"model": "tweet-summary",
"prompt": "Write as if you were a critic: San Francisco",
"max_tokens": 100,
"temperature": 0
}'
```

1. Fork the repository.
2. Create a new branch.
3. Make your changes.
4. Open a pull request.

## License

This project is licensed under the MIT License.

---
68 changes: 0 additions & 68 deletions examples/poc/manifests/ext-proc.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
apiVersion: v1
kind: Namespace
metadata:
name: llm-gateway

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: custom-proxy-config
namespace: envoy-gateway-system
name: llm-route-envoy-config
namespace: llm-gateway
spec:
provider:
type: Kubernetes
kubernetes:
envoyDeployment:
container:
image: envoyproxy/envoy:v1.31-latest
envoyService:
patch:
type: StrategicMerge
Expand Down Expand Up @@ -78,7 +80,7 @@ spec:
dns_lookup_family: V4_ONLY
- name: ext_proc_cluster
connect_timeout: 1000s
type: STATIC
type: LOGICAL_DNS
http2_protocol_options: {}
lb_policy: ROUND_ROBIN
load_assignment:
Expand All @@ -88,28 +90,66 @@ spec:
- endpoint:
address:
socket_address:
address: 34.118.231.147
address: llm-route-ext-proc.llm-gateway.svc.cluster.local
port_value: 9002
---
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: inference-gateway
name: llm-gateway
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
parametersRef:
group: gateway.envoyproxy.io
kind: EnvoyProxy
name: custom-proxy-config
namespace: envoy-gateway-system
name: llm-route-envoy-config
namespace: llm-gateway

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-route-ext-proc
namespace: llm-gateway
labels:
app: llm-route-ext-proc
spec:
replicas: 1
selector:
matchLabels:
app: llm-route-ext-proc
template:
metadata:
labels:
app: llm-route-ext-proc
spec:
containers:
- name: llm-route-ext-proc
image: ghcr.io/tomatillo-and-multiverse/ext-proc:demo
args:
#TODO: specify label selector and dynamically update pods
- -pods
- "vllm-78665f78c4-h4kx4,vllm-78665f78c4-hnz84"
- -podIPs
- "10.24.11.6:8000,10.24.5.7:8000"
- -enable-fairness
- "false"
ports:
- containerPort: 9002
- name: curl
image: curlimages/curl
command: ["sleep", "3600"]
---
apiVersion: v1
kind: Service
metadata:
name: inference-gateway
name: llm-route-ext-proc
namespace: llm-gateway
spec:
gatewayClassName: inference-gateway
listeners:
- name: http
protocol: HTTP
port: 8080
selector:
app: llm-route-ext-proc
ports:
- protocol: TCP
port: 9002
targetPort: 9002
type: ClusterIP
12 changes: 12 additions & 0 deletions examples/poc/manifests/samples/gateway.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: llm-gateway
spec:
gatewayClassName: llm-gateway
listeners:
- name: http
protocol: HTTP
port: 8080
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ metadata:
name: vllm-lora
namespace: default
spec:
clusterIP: None
selector:
app: vllm
ports:
Expand Down