Skip to content

Commit 14e4b10

Browse files
committed
Dynamic lora load/unload sidecar
1 parent 18bc3a2 commit 14e4b10

File tree

7 files changed

+560
-0
lines changed

7 files changed

+560
-0
lines changed
+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
2+
FROM python:3.10-slim-buster
3+
4+
WORKDIR /dynamic-lora-reconciler
5+
6+
RUN python3 -m venv /opt/venv
7+
8+
ENV PATH="/opt/venv/bin:$PATH"
9+
10+
RUN pip install --upgrade pip
11+
COPY requirements.txt .
12+
RUN pip install --no-cache-dir -r requirements.txt
13+
14+
COPY sidecar/sidecar.py .
15+
16+
CMD ["python", "sidecar.py"]

examples/dynamic-lora-sidecar/README

+98
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# Dynamic LORA Adapter Sidecar for vLLM
2+
3+
This directory contains script for a sidecar container to dynamically manage LORA adapters for a vLLM server running in the same Kubernetes pod by reconciling it with a configmap containing lora adapters.
4+
5+
## Overview
6+
7+
The sidecar continuously monitors a ConfigMap mounted as a YAML configuration file. This file defines the desired state of LORA adapters, including:
8+
9+
- **Adapter ID:** Unique identifier for the adapter.
10+
- **Source:** Path to the adapter's source files.
11+
- **Base Model:** The base model to which the adapter should be applied.
12+
- **toRemove:** (Optional) Indicates whether the adapter should be unloaded.
13+
14+
The sidecar uses the vLLM server's API to load or unload adapters based on the configuration. It also periodically reconciles the registered adapters on the vLLM server with the desired state defined in the ConfigMap, ensuring consistency.
15+
16+
## Features
17+
18+
- **Dynamic Loading and Unloading:** Load and unload LORA adapters without restarting the vLLM server.
19+
- **Continuous Reconciliation:** Ensures the vLLM server's state matches the desired configuration.
20+
- **ConfigMap Integration:** Leverages Kubernetes ConfigMaps for easy configuration management.
21+
- **Easy Deployment:** Provides a sample deployment YAML for quick setup.
22+
23+
## Repository Contents
24+
25+
- **`sidecar.py`:** Python script for the sidecar container.
26+
- **`Dockerfile`:** Dockerfile to build the sidecar image.
27+
- **`configmap.yaml`:** Example ConfigMap YAML file.
28+
- **`deployment.yaml`:** Example Kubernetes deployment YAML.
29+
30+
## Usage
31+
32+
1. **Build the Docker Image:**
33+
```bash
34+
docker build -t <your-image-name> .
35+
2. **Create a configmap:**
36+
```bash
37+
kubectl create configmap name-of-your-configmap --from-file=your-file.yaml
38+
3. **Mount the configmap and configure sidecar in your pod**
39+
```yaml
40+
spec:
41+
shareProcessNamespace: true
42+
containers:
43+
- name: inference-server
44+
image: vllm/vllm-openai:v0.6.3.post1
45+
resources:
46+
requests:
47+
cpu: 5
48+
memory: 20Gi
49+
ephemeral-storage: 40Gi
50+
nvidia.com/gpu : 1
51+
limits:
52+
cpu: 5
53+
memory: 20Gi
54+
ephemeral-storage: 40Gi
55+
nvidia.com/gpu : 1
56+
command: ["/bin/sh", "-c"]
57+
args:
58+
- vllm serve meta-llama/Llama-2-7b-hf
59+
- --host=0.0.0.0
60+
- --port=8000
61+
- --tensor-parallel-size=1
62+
- --swap-space=16
63+
- --gpu-memory-utilization=0.95
64+
- --max-model-len=2048
65+
- --max-num-batched-tokens=4096
66+
- --disable-log-stats
67+
- --enable-loras
68+
- --max-loras=5
69+
env:
70+
- name: DEPLOY_SOURCE
71+
value: UI_NATIVE_MODEL
72+
- name: MODEL_ID
73+
value: "Llama2-7B"
74+
- name: AIP_STORAGE_URI
75+
value: "gs://vertex-model-garden-public-us/llama2/llama2-7b-hf"
76+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
77+
value: "true"
78+
volumeMounts:
79+
- mountPath: /dev/shm
80+
name: dshm
81+
initContainers:
82+
- name: configmap-reader-1
83+
image: us-docker.pkg.dev/kunjanp-gke-dev-2/lora-sidecar/sidecar:latest
84+
restartPolicy: Always
85+
env:
86+
DYNAMIC_LORA_ROLLOUT_CONFIG: "/config/configmap.yaml"
87+
volumeMounts:
88+
- name: config-volume
89+
mountPath: /config/configmap.yaml
90+
subPath: configmap.yaml
91+
volumes:
92+
- name: dshm
93+
emptyDir:
94+
medium: Memory
95+
- name: config-volume
96+
configMap:
97+
name: dynamic-lora-config
98+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: llama-deployment
5+
spec:
6+
replicas: 1
7+
selector:
8+
matchLabels:
9+
app: llama-server
10+
template:
11+
metadata:
12+
labels:
13+
app: llama-server
14+
ai.gke.io/model: LLaMA2_7B
15+
ai.gke.io/inference-server: vllm
16+
examples.ai.gke.io/source: model-garden
17+
spec:
18+
shareProcessNamespace: true
19+
containers:
20+
- name: inference-server
21+
image: vllm/vllm-openai:v0.6.3.post1
22+
resources:
23+
requests:
24+
cpu: 5
25+
memory: 20Gi
26+
ephemeral-storage: 40Gi
27+
nvidia.com/gpu : 1
28+
limits:
29+
cpu: 5
30+
memory: 20Gi
31+
ephemeral-storage: 40Gi
32+
nvidia.com/gpu : 1
33+
command: ["/bin/sh", "-c"]
34+
args:
35+
- vllm serve meta-llama/Llama-2-7b-hf
36+
- --host=0.0.0.0
37+
- --port=8000
38+
- --tensor-parallel-size=1
39+
- --swap-space=16
40+
- --gpu-memory-utilization=0.95
41+
- --max-model-len=2048
42+
- --max-num-batched-tokens=4096
43+
- --disable-log-stats
44+
- --enable-loras
45+
- --max-loras=5
46+
env:
47+
- name: DEPLOY_SOURCE
48+
value: UI_NATIVE_MODEL
49+
- name: MODEL_ID
50+
value: "Llama2-7B"
51+
- name: AIP_STORAGE_URI
52+
value: "gs://vertex-model-garden-public-us/llama2/llama2-7b-hf"
53+
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
54+
value: "true"
55+
volumeMounts:
56+
- mountPath: /dev/shm
57+
name: dshm
58+
initContainers:
59+
- name: configmap-reader-1
60+
image: us-docker.pkg.dev/kunjanp-gke-dev-2/lora-sidecar/sidecar:latest
61+
restartPolicy: Always
62+
env:
63+
DYNAMIC_LORA_ROLLOUT_CONFIG: "/config/configmap.yaml"
64+
volumeMounts:
65+
- name: config-volume
66+
mountPath: /config/configmap.yaml
67+
subPath: configmap.yaml
68+
volumes:
69+
- name: dshm
70+
emptyDir:
71+
medium: Memory
72+
- name: config-volume
73+
configMap:
74+
name: dynamic-lora-config
75+
nodeSelector:
76+
cloud.google.com/gke-accelerator: nvidia-l4
77+
cloud.google.com/gke-nodepool: dynamic-lora
78+
79+
---
80+
apiVersion: v1
81+
kind: Service
82+
metadata:
83+
name: llama-service
84+
spec:
85+
selector:
86+
app: llama-server
87+
type: ClusterIP
88+
ports:
89+
- protocol: TCP
90+
port: 8000
91+
targetPort: 8000
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
aiohttp==3.10.10
2+
pyyaml==6.0.2
3+
requests==2.32.3
4+
watchdog==5.0.3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
deployment:
2+
host: localhost
3+
models:
4+
- base-model: meta-llama/Llama-2-7b-hf
5+
id: sql-lora-v1
6+
source: yard1/llama-2-7b-sql-lora-test
7+
status:
8+
errors:
9+
- ''
10+
operation: load
11+
timestamp: 2024-10-23 15:43:07 UTC+0000
12+
toRemove: false
13+
- base-model: meta-llama/Llama-2-7b-hf
14+
id: sql-lora-v2
15+
source: yard1/llama-2-7b-sql-lora-test
16+
status:
17+
errors:
18+
- already unloaded
19+
operation: unload
20+
timestamp: 2024-10-23 15:43:07 UTC+0000
21+
toRemove: true
22+
name: sql-loras-llama
23+
port: '8000'

0 commit comments

Comments
 (0)