Skip to content

Dynamic lora load/unload sidecar #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
14e4b10
Dynamic lora load/unload sidecar
coolkp Oct 23, 2024
bcfee4a
Formatting
coolkp Oct 23, 2024
cb45fe2
Resolve README comments
coolkp Oct 30, 2024
62da988
Address comments on sidecar, store updates in memory, rename base field
coolkp Oct 30, 2024
56cffc2
Address comments in example deployment
coolkp Oct 30, 2024
5cbaeef
Address comments in example deployment
coolkp Oct 30, 2024
5a03f98
base model is optional
coolkp Oct 30, 2024
1af2df4
Check health of server before querying
coolkp Nov 5, 2024
5b51182
Check health of server before querying
coolkp Nov 5, 2024
cc1e686
Docstrings
coolkp Nov 5, 2024
926a71c
Mock health check in tests
coolkp Nov 5, 2024
cb3c9b2
Refactor configmap, switch to watchfiles to detect symbolic link targ…
coolkp Nov 7, 2024
3140610
Refactor configmap, switch to watchfiles to detect symbolic link targ…
coolkp Nov 7, 2024
65cea88
Modify unittests
coolkp Nov 8, 2024
8012ea3
Change example host and port to be explicit
coolkp Nov 8, 2024
ba00b85
Change example sidecar name
coolkp Nov 8, 2024
c8d9c10
Add warning about using subPath
coolkp Nov 8, 2024
828348d
Add screenshots
coolkp Nov 8, 2024
ec40820
Add screenshots
coolkp Nov 8, 2024
1aba325
Add testing results
coolkp Nov 9, 2024
b30051a
Add testing results
coolkp Nov 9, 2024
c5d2527
Add config validation
coolkp Nov 11, 2024
d0d01e1
Add config documentation
coolkp Nov 11, 2024
b4867b6
Add config documentation
coolkp Nov 11, 2024
e60b434
Add config validation
coolkp Nov 11, 2024
bea4068
Add config validation
coolkp Nov 11, 2024
100f636
Make reconciling non blocking
coolkp Nov 11, 2024
c24ff35
Move under tools
coolkp Nov 12, 2024
472b545
Move under tools
coolkp Nov 12, 2024
5354a47
Document usage of sidecar, available by default from 1.29
coolkp Nov 13, 2024
bc2ce32
Document usage of sidecar, available by default from 1.29
coolkp Nov 13, 2024
f82d8b2
Document usage of sidecar, available by default from 1.29
coolkp Nov 13, 2024
e01ec51
Update tools/dynamic-lora-sidecar/README.md
coolkp Nov 16, 2024
28779e7
Update tools/dynamic-lora-sidecar/README.md
coolkp Nov 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/dynamic-lora-sidecar/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sidecar/__pycache__/
16 changes: 16 additions & 0 deletions examples/dynamic-lora-sidecar/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@

FROM python:3.10-slim-buster

WORKDIR /dynamic-lora-reconciler

RUN python3 -m venv /opt/venv

ENV PATH="/opt/venv/bin:$PATH"

RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY sidecar/sidecar.py .

CMD ["python", "sidecar.py"]
52 changes: 52 additions & 0 deletions examples/dynamic-lora-sidecar/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Dynamic LORA Adapter Sidecar for vLLM

This directory contains a configmap containing lora adapters configurations and script for a sidecar container to dynamically manage LORA adapters for a vLLM server running in the same Kubernetes pod by reconciling it with a configmap containing lora adapters.

## Overview

The sidecar continuously monitors a ConfigMap mounted as a YAML configuration file. This file defines the desired state of LORA adapters, including:

- **Adapter ID:** Unique identifier for the adapter.
- **Source:** Path to the adapter's source files.
- **Base Model:** The base model to which the adapter should be applied.
- **toRemove:** (Optional) Indicates whether the adapter should be unloaded.

The sidecar uses the vLLM server's API to load or unload adapters based on the configuration. It also periodically reconciles the registered adapters on the vLLM server with the desired state defined in the ConfigMap, ensuring consistency.

## Features

- **Dynamic Loading and Unloading:** Load and unload LORA adapters without restarting the vLLM server.
- **Continuous Reconciliation:** Ensures the vLLM server's state matches the desired configuration.
- **ConfigMap Integration:** Leverages Kubernetes ConfigMaps for easy configuration management.
- **Easy Deployment:** Provides a sample deployment YAML for quick setup.

## Repository Contents

- **`sidecar.py`:** Python script for the sidecar container.
- **`Dockerfile`:** Dockerfile to build the sidecar image.
- **`configmap.yaml`:** Example ConfigMap YAML file.
- **`deployment.yaml`:** Example Kubernetes deployment YAML.

## Usage

1. **Build the Docker Image:**
```bash
docker build -t <your-image-name> .
2. **Create a configmap:**
```bash
kubectl create configmap name-of-your-configmap --from-file=your-file.yaml
3. **Mount the configmap and configure sidecar in your pod**
```yaml
volumeMounts: # DO NOT USE subPath
- name: config-volume
mountPath: /config
```
Do not use subPath, since configmap updates are not reflected in the file

[deployment]: deployment.yaml

## Screenshots & Testing
I tested the sidecar in my cluster with deployment and configmap specified in this repo. Here are the screen grabs of the logs from the sidecar and vllm server. I used the specified configmap, verified that the adapters were loaded by querying `v1/models` and looking at vllm logs. I changed the configmap and validated the same on vllm server. Note: There is slight lag between updates.
![lora-adapter-syncer](screenshots/lora-syncer-sidecar.png)
![config map change](screenshots/configmap-change.png)
![vllm-logs](screenshots/vllm-logs.png)
129 changes: 129 additions & 0 deletions examples/dynamic-lora-sidecar/deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-deployment
spec:
replicas: 1
selector:
matchLabels:
app: llama-server
template:
metadata:
labels:
app: llama-server
ai.gke.io/model: LLaMA2_7B
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: model-garden
spec:
shareProcessNamespace: true
containers:
- name: inference-server
image: vllm/vllm-openai:v0.6.3.post1
resources:
requests:
cpu: 5
memory: 20Gi
ephemeral-storage: 40Gi
nvidia.com/gpu : 1
limits:
cpu: 5
memory: 20Gi
ephemeral-storage: 40Gi
nvidia.com/gpu : 1
command: ["/bin/sh", "-c"]
args:
- vllm serve meta-llama/Llama-2-7b-hf
- --host=0.0.0.0
- --port=8000
- --tensor-parallel-size=1
- --swap-space=16
- --gpu-memory-utilization=0.95
- --max-model-len=2048
- --max-num-batched-tokens=4096
- --disable-log-stats
- --enable-loras
- --max-loras=5
env:
- name: DEPLOY_SOURCE
value: UI_NATIVE_MODEL
- name: MODEL_ID
value: "Llama2-7B"
- name: AIP_STORAGE_URI
value: "gs://vertex-model-garden-public-us/llama2/llama2-7b-hf"
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
value: "true"
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token # The name of your Kubernetes Secret
key: token # The specific key within the Secret
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts:
- mountPath: /dev/shm
name: dshm
initContainers:
- name: lora-adapter-syncer
tty: true
stdin: true
image: us-docker.pkg.dev/kunjanp-gke-dev-2/lora-sidecar/sidecar:latest
restartPolicy: Always
imagePullPolicy: Always
env:
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts: # DO NOT USE subPath
- name: config-volume
mountPath: /config
volumes:
- name: dshm
emptyDir:
medium: Memory
- name: config-volume
configMap:
name: dynamic-lora-config
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4

---
apiVersion: v1
kind: Service
metadata:
name: llama-service
spec:
selector:
app: llama-server
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000

---

apiVersion: v1
kind: ConfigMap
metadata:
name: dynamic-lora-config
data:
configmap.yaml: |
vLLMLoRAConfig:
host: modelServerHost
name: sql-loras-llama
port: modelServerPort
ensureExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v1
source: yard1/llama-2-7b-sql-lora-test
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v3
source: yard1/llama-2-7b-sql-lora-test
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v4
source: yard1/llama-2-7b-sql-lora-test
ensureNotExist:
models:
- base-model: meta-llama/Llama-2-7b-hf
id: sql-lora-v2
source: yard1/llama-2-7b-sql-lora-test
4 changes: 4 additions & 0 deletions examples/dynamic-lora-sidecar/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
aiohttp==3.10.10
pyyaml==6.0.2
requests==2.32.3
watchfiles==0.24.0
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading