Skip to content

Prefix Cache Aware Proposal #498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #493
kfswain opened this issue Mar 14, 2025 · 10 comments
Closed
Tracked by #493

Prefix Cache Aware Proposal #498

kfswain opened this issue Mar 14, 2025 · 10 comments
Assignees
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@kfswain
Copy link
Collaborator

kfswain commented Mar 14, 2025

We should have a proposal for how we will handle prefix-cache aware routing and have it accepted by the community in order to begin implementation

@liu-cong
Copy link
Contributor

liu-cong commented Mar 20, 2025

I looked into this and here is a writeup on the problem and proposed solution. Please feel free to comment!

Background

Prefix caching is a well-known technique in LLM inference to save duplicate tensor computation for prompts with the same prefixes, and is available in many model servers, as well as inference frameworks.

vLLM has the automatic prefix cache (APC) feature by caching in the accelerator HBM, and uses an LRU cache eviction strategy.

vLLM production stack is exploring a prefix aware router to exploit the APC feature of the vLLM. The WIP PR implements two strategies: a HashTrie based matching and a SimHash based consistent hashing. The HashTrie solution is showing better cache hit rate.

SGLang has a cache aware routing strategy which builds a radix tree based on request history.

AIBrix uses a distributed prefix cache pool and has a customized vLLM to support loading cache from the pool. At request routing, it has a Prefix Router that maximizes prefix cache hit on model server HBM. It currently implements a hash based (similar to vLLM) and radix tree based (similar to SGLang) matching strategy.

KubeAI uses a Consistent Hashing with Bounded Loads (CHWBL) algorithm which hashes request prefixes up to a configurable length (and therefore will lose some accuracy), and use an "overflow" strategy when the server is hot loaded.

Goals

Implement a prefix aware routing algorithm on EPP to maximize the cache hit rate on the model servers.

Non-goals

  • Change how model server manages prefix caches, e.g., add DRAM cache support or remote cache support.

Design Options

1. Session affinity

Session affinity is based on client attributes such as IP address. It works well for use cases such as multi-turn conversations, where requests from the same client tend to share the same prefixes. This, of course, highly depends on the nature of the use case.

Pros:

  • Easy to implement/understand

Cons:

  • Limited use case
  • Does not exploit prefix cache between different clients
  • Using client IP isn't always reliable, will likely need client to provide "session info" for good affinity

2. Prefix affinity consistent hashing

This goes a step beyond the session affinity by using a prefix aware hash function to route requests with similar prefixes to the same or similar servers. A naive hash function can be just taking the hash of the first N characters/tokens of the request, and therefore all requests with the same first N characters/tokens will be routed to the same server. The vLLM production stack is exploring this strategy using simhash, and preliminary experiments showed mixed results. KubeAI uses a simple strategy to only hash request prefix up to a configurable prefixCharLength. Its effectiveness is likely highly dependent on the input length distribution.

Pros:

  • (Compared to session affinity) Is aware of prefix and not limited to per-client affinity
  • Small memory overhead (just need to store the ring of the servers)

Cons:

  • Highly depends on the effectiveness of the prefix aware hash function.
  • Consistent hashing can be challenging to reason about.

3. Approximate prefix cache on the router

This builds on the intuition that if requestA=prefix+XX was routed to server 1, then routing requstB=prefix+YY to the same server will likely hit its prefix cache. Therefore the central router can build an approximate lookup cache of the prefix caches on all the backend servers, by mimicking a similar cache eviction strategy of the model server (e.g., LRU).

Pros:

  • Easy to explain (compared to hashing) and likely more effective than hashing strategy.

Cons:

  • Relies on knowledge of the cache eviction strategy of the model server, and may need careful tuning for different environments (e.g., model server with different total kv cache space may have different characteristics of cache eviction).
  • Complexity in managing cache state (eviction, memory limit)
  • An in memory cache is preferred for high performance. However, that means cache need to be rebuilt for restarts. Moreover, cache hit performance decreases with multiple active EPP replicas

4. Accurate prefix cache on the router

If the router knows what prefixes are currently cached on each model server replica, it can make the optimal decision. A potential solution is to have the model server (or with a sidecar) report the kv cache indexes to the router.

Pros:

  • Best cache hit rate

Cons:

  • Requires adding a sidecar and its integration with the model server
  • May face scalability concerns with large number of model server replicas and large number of prefix caches

How does prefix cache affinity routing work with LoRA affinity and load-aware routing

Prefix cache needs to be LoRA aware, as different adapters don’t share the same kv cache. Therefore, prefix cache affinity algo should be applied after LoRA affinity, to avoid conflicts. And by doing so, the requests will naturally be colocated by the same LoRA, and further by prefix matching.

Prefix affinity needs to be aware of the server load, otherwise we will create hot spots. To prevent this, a 2 step process can be used:

  1. Only use prefix affinity if the expected gain is high enough. This can be done with a threshold, either on the prefix matching length, or a matching ratio of the entire request, or a combination of both. If the expected gain is below the threshold (e.g, only 10 tokens match), then we ignore prefix affinity and fall back to current load based algo.
  2. Similar to the queue depth threshold for LoRA affinity, we define a similar one for prefix affinity. This prevents prefix affinity to overload a server indefinitely.

Proposal

Preferred Approach

Implement an approximate prefix cache lookup on the EPP, similar to SGLang or AIBrix, potentially use a vLLM-alike strategy, i.e., use prefix block hash and an LRU cache eviction strategy.

A request is broken down into N chunks of the same number of characters (we don’t necessarily need to tokenize). For each chunk we will calculate a hash based on the content of the chunk + hash of the prefix: hash(chunk i) = hash(chunk i content + hash(chunk i-1)).

When we route a request r1 with N chunks to a server s1, we update the approximate cache lookup table like so:

hash(chunk 1): append s1
hash(chunk 2): append s1
…
hash(chunk N): append s1

This means all these N chunks are cached on server s1.

When the EPP receives a new request r2, we calculate its chunk hashes, and look up the table to find a server with longest prefix matching.

Each entry in the table needs a lastUpdate time to allow LRU cache eviction.

Alternative Approach

The key to the prefix aware consistent hashing is to find a great hash algorithm. vLLM production stack is exploring simhash, which is an algorithm used to detect similarities between sets, which is not quite exact as prefix matching. KubeAI uses a simple hash that only cares about request prefix to a certain limit. I propose an improved algorithm which breaks the input into chunks and gives higher weight for chunks in the beginning. The result of this algorithm is that two inputs with more prefix matching should end up with closer hash results. However this needs careful tuning and benchmarking to validate.

@ahg-g
Copy link
Contributor

ahg-g commented Mar 22, 2025

+1 to the preferred approach, this seems like a good first attempt that would provide tangible benefits.

@liu-cong
Copy link
Contributor

Implementation Plan

PrefixMatcher

Conceptually the goal is to find the maximum prefix match of the input string. Instead of matching characters, we divide the string into chunks of size L to improve efficiency. The loss of accuracy is not important as long as we pick a reasonable L. Instead of storing the chunk characters, we calculate a hash value to reduce storage footprint using a high performance hash function such as xxhash.

For each chunk we will calculate a hash based on the** content of the chunk + hash of the prefix**: hash(chunk i) = hash(chunk i content + hash(chunk i-1)). This has the nice property that if hash(request1, chunk i) == hash(request1, chunk i), then we can claim they match prefixes at least until chunk i.

**Caveat: **Since different targetModels cannot share the same prefix cache, we will always hash the targetModel in the first chunk.

Longest Prefix Matching

For an incoming request r with N chunks, we will implement a greedy search strategy by searching the cache look up table for the Nth chunk, and then (N-1)th, and so on. Note we can follow up with a binary search strategy to improve the search efficiency.

Update Cache

When we route a request r1 with N chunks to a server s1, we update the approximate cache lookup table like so:

hash(r1, chunk 1): append s1
hash(r1, chunk 2): append s1
…
hash(r1, chunk N): append s1

This means all these N chunks are cached on server s1.

Concretely in go, the cache look up table is defined like so:

type PrefixCacheLookupTable struct {
        table map[BlockHash]CachedServers
        // Keep track of the total size of the table to avoid it exceeding a limit.
        size int
}
type BlockHash int64
// key: NamespacedName of the pod;
// value: The last time the corresponding chunk was sent to this server.
type CachedServers map[types.NamespacedName]time.Time

Cache Eviction

We spin up a goroutine which runs every cacheEvictionInterval to evict cache entries whose last accessed time exceeds cacheExpiryDuration.

Note: we can also leverage the cache eviction go routine to update the size of the cache.

Cache Size Cap

We need to ensure the cache size never exceeds a configurable maxPrefixCacheSize. If the size exceeds the limit, we simply stop adding new cache entries until old entries are evicted.

Implementation

We will create a prefixCacheMatcher in the scheduling package.

type prefixMatcher struct {
   // cacheChunkSize is the size of each chunk in the cache. Requests with length shorter than the
   // chunk size will be ignored.
   cacheChunkSize int
   // goroutine interval to scan the cache and evict entries.
   cacheEvictionInterval time.Duration
   // cacheExpiration is the time after which the cache entry is considered expired.
   cacheExpiration time.Duration
   // If the cache size exceeds this limit, the cache will stop recording new entries.
   // Current (approximate) size of the cache.
   maxCacheSize int

   size         int
   table        *PrefixCacheLookupTable
}

func NewPrefixCacheMatcher() *prefixMatcher {
   return &prefixMatcher{}
}

// Given a request, and a list of candidate pods, look up the prefix cache table to find pods with the longest prefix match.
func (m *prefixMatcher) Match(request *LLMRequest, pods []backendmetrics.PodMetrics) ([]backendmetrics.PodMetrics, error) {
}

// If a request was routed to a pod, record it in the cache:
// Append the pod to the chunk hashes of the request. If the chunk was already cached in this pod, update it's last updated time.
func (m *prefixMatcher) Record(request *LLMRequest, pod backendmetrics.Pod) {
}

func (m *prefixMatcher) Evict(request *LLMRequest, pods []backendmetrics.PodMetrics) ([]backendmetrics.PodMetrics, error) {
   // Remove cache entries if last updated time is older than a threshold.
}

Routing Algorithm

Prefix cache matching needs to be model specific, therefore we apply prefix cache after LoRA affinity. How does prefix caching logic interact with the least queuing and least kv cache logic? We define a LoadAwarePrefixCacheFilter that limits the prefix cache matching candidates to servers with load under a certain threshold.

In scheduler.go, we define this:

   loadAwarePrefixCacheFilter = &filter{
       name: "load aware prefix cache filter",
       // Apply prefix caching filter to pods under load threshold if possible.
       // If all pods are above the threshold, then we apply prefix caching on all of them.
       filter: toFilterFunc(queueAndKVCacheThresholdPredicate(queueThreshold, kvCacheThreshold)), 
       nextOnSuccessOrFailure: &filter{
           name:   "pods with longest prefix cache match",
           filter: prefixCacheFilterFunc, // new filter func
       },
   }

This new filter will be applied after the LoRA affinity filter, and before any other filters.

Configuration Knobs

In scheduler.go, we will add

prefixCachingQueueThreshold // default to existing queueThresholdCritical (5)
prefixCachingKVCacheThreshold // default to existing kvCacheThreshold (0.8)

In prefixMatcher.go, we will add

  // cacheChunkSize is the size of each chunk in the cache. Requests with length shorter than the
   // chunk size will be ignored.
   cacheChunkSize int
   // goroutine interval to scan the cache and evict entries.
   cacheEvictionInterval time.Duration
   // cacheExpiration is the time after which the cache entry is considered expired.
   cacheExpiration time.Duration
   // If the cache size (in MB) exceeds this limit, the cache will stop recording new entries.
   // Current (approximate) size of the cache.
   maxCacheSizeMB int

@smarterclayton
Copy link
Contributor

I would suggest from a scheduling perspective that we move to a standard filter and score approach as soon as possible - prefix matches are scoring, kvcache can be a filter or a score (hard or soft) or potentially both. I would expect to let people have some flexibility to tweak score weighting in code before moving to make them tunable, for sure.

@smarterclayton
Copy link
Contributor

I'll note another pro of the approximate approach is that we already have a need to track in flight requests and understand the delta of an given EPP's traffic from its peers in order to make better scheduling decisions (often called an assumption) - this is consistent with a general approach of having composable scheduling strategies. We should be able to estimate a score that makes kv-cache awareness more accurate as well as estimating the cost of the request better.

@smarterclayton
Copy link
Contributor

I think this would be appropriate to move to a proposal PR so we can add comments inline and begin iterating.

@liu-cong
Copy link
Contributor

I think this would be appropriate to move to a proposal PR so we can add comments inline and begin iterating.

SG, preparing a PR for feedback

@liu-cong
Copy link
Contributor

Drafted the proposal PR: #602

@kfswain kfswain added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Apr 24, 2025
@liu-cong
Copy link
Contributor

/close

#602

@k8s-ci-robot
Copy link
Contributor

@liu-cong: Closing this issue.

In response to this:

/close

#602

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants