-
Notifications
You must be signed in to change notification settings - Fork 88
Prefix Cache Aware Proposal #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I looked into this and here is a writeup on the problem and proposed solution. Please feel free to comment! BackgroundPrefix caching is a well-known technique in LLM inference to save duplicate tensor computation for prompts with the same prefixes, and is available in many model servers, as well as inference frameworks. vLLM has the automatic prefix cache (APC) feature by caching in the accelerator HBM, and uses an LRU cache eviction strategy. vLLM production stack is exploring a prefix aware router to exploit the APC feature of the vLLM. The WIP PR implements two strategies: a HashTrie based matching and a SimHash based consistent hashing. The HashTrie solution is showing better cache hit rate. SGLang has a cache aware routing strategy which builds a radix tree based on request history. AIBrix uses a distributed prefix cache pool and has a customized vLLM to support loading cache from the pool. At request routing, it has a Prefix Router that maximizes prefix cache hit on model server HBM. It currently implements a hash based (similar to vLLM) and radix tree based (similar to SGLang) matching strategy. KubeAI uses a Consistent Hashing with Bounded Loads (CHWBL) algorithm which hashes request prefixes up to a configurable length (and therefore will lose some accuracy), and use an "overflow" strategy when the server is hot loaded. GoalsImplement a prefix aware routing algorithm on EPP to maximize the cache hit rate on the model servers. Non-goals
Design Options1. Session affinitySession affinity is based on client attributes such as IP address. It works well for use cases such as multi-turn conversations, where requests from the same client tend to share the same prefixes. This, of course, highly depends on the nature of the use case. Pros:
Cons:
2. Prefix affinity consistent hashingThis goes a step beyond the session affinity by using a prefix aware hash function to route requests with similar prefixes to the same or similar servers. A naive hash function can be just taking the hash of the first N characters/tokens of the request, and therefore all requests with the same first N characters/tokens will be routed to the same server. The vLLM production stack is exploring this strategy using simhash, and preliminary experiments showed mixed results. KubeAI uses a simple strategy to only hash request prefix up to a configurable Pros:
Cons:
3. Approximate prefix cache on the routerThis builds on the intuition that if Pros:
Cons:
4. Accurate prefix cache on the routerIf the router knows what prefixes are currently cached on each model server replica, it can make the optimal decision. A potential solution is to have the model server (or with a sidecar) report the kv cache indexes to the router. Pros:
Cons:
How does prefix cache affinity routing work with LoRA affinity and load-aware routingPrefix cache needs to be LoRA aware, as different adapters don’t share the same kv cache. Therefore, prefix cache affinity algo should be applied after LoRA affinity, to avoid conflicts. And by doing so, the requests will naturally be colocated by the same LoRA, and further by prefix matching. Prefix affinity needs to be aware of the server load, otherwise we will create hot spots. To prevent this, a 2 step process can be used:
ProposalPreferred Approach Implement an approximate prefix cache lookup on the EPP, similar to SGLang or AIBrix, potentially use a vLLM-alike strategy, i.e., use prefix block hash and an LRU cache eviction strategy. A request is broken down into N chunks of the same number of characters (we don’t necessarily need to tokenize). For each chunk we will calculate a hash based on the content of the chunk + hash of the prefix: When we route a request
This means all these N chunks are cached on server When the EPP receives a new request Each entry in the table needs a Alternative Approach The key to the prefix aware consistent hashing is to find a great hash algorithm. vLLM production stack is exploring simhash, which is an algorithm used to detect similarities between sets, which is not quite exact as prefix matching. KubeAI uses a simple hash that only cares about request prefix to a certain limit. I propose an improved algorithm which breaks the input into chunks and gives higher weight for chunks in the beginning. The result of this algorithm is that two inputs with more prefix matching should end up with closer hash results. However this needs careful tuning and benchmarking to validate. |
+1 to the preferred approach, this seems like a good first attempt that would provide tangible benefits. |
Implementation PlanPrefixMatcherConceptually the goal is to find the maximum prefix match of the input string. Instead of matching characters, we divide the string into chunks of size L to improve efficiency. The loss of accuracy is not important as long as we pick a reasonable L. Instead of storing the chunk characters, we calculate a hash value to reduce storage footprint using a high performance hash function such as xxhash. For each chunk we will calculate a hash based on the** content of the chunk + hash of the prefix**: **Caveat: **Since different targetModels cannot share the same prefix cache, we will always hash the targetModel in the first chunk. Longest Prefix MatchingFor an incoming request Update CacheWhen we route a request
This means all these N chunks are cached on server Concretely in go, the cache look up table is defined like so:
Cache EvictionWe spin up a goroutine which runs every Note: we can also leverage the cache eviction go routine to update the size of the cache. Cache Size CapWe need to ensure the cache size never exceeds a configurable ImplementationWe will create a prefixCacheMatcher in the
Routing AlgorithmPrefix cache matching needs to be model specific, therefore we apply prefix cache after LoRA affinity. How does prefix caching logic interact with the In
This new filter will be applied after the LoRA affinity filter, and before any other filters. Configuration KnobsIn
In
|
I would suggest from a scheduling perspective that we move to a standard filter and score approach as soon as possible - prefix matches are scoring, kvcache can be a filter or a score (hard or soft) or potentially both. I would expect to let people have some flexibility to tweak score weighting in code before moving to make them tunable, for sure. |
I'll note another pro of the approximate approach is that we already have a need to track in flight requests and understand the delta of an given EPP's traffic from its peers in order to make better scheduling decisions (often called an assumption) - this is consistent with a general approach of having composable scheduling strategies. We should be able to estimate a score that makes kv-cache awareness more accurate as well as estimating the cost of the request better. |
I think this would be appropriate to move to a proposal PR so we can add comments inline and begin iterating. |
SG, preparing a PR for feedback |
Drafted the proposal PR: #602 |
/close |
@liu-cong: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
We should have a proposal for how we will handle prefix-cache aware routing and have it accepted by the community in order to begin implementation
The text was updated successfully, but these errors were encountered: