routing inference workloads - requirements discussion #276

nirrozenbaum · 2025-02-03T18:30:36Z

Hi,
My name is Nir Rozenbaum, and I am part of a team at IBM Research working on routing of inference workloads.
We believe the Gateway API inference extension could be a strong fit for our needs, and we would like to discuss some key requirements based on our customer use cases.

Some of the requirements we would like to raise include:

Serving Request Priority (SLA-based Routing): In our use case, request prioritization is determined by customer SLAs, ensuring that higher-paying customers receive priority over lower-tier or internal users. While the current design allows prioritization based on model criticality, we need a mechanism to prioritize requests dynamically based on SLA tiers. For example, inference requests from IBMers using IBM Cloud for internal purposes (free tier) should be deprioritized in favor of paying customers. Similarly, customers on a premium plan should experience lower wait times than those on a standard plan.
Session Affinity & Cache-Aware Routing: We need the ability to route requests to a specific vLLM pod based on session headers. This ensures that inference requests within the same session are consistently directed to the same vLLM instance, improving efficiency and caching performance.
Maintaining Existing Routing Logic: These capabilities should be additive, complementing the current model-based and LoRA model-aware routing mechanisms rather than replacing them.
Additionally, we are looking into addressing scalability and fault tolerance challenges in the current reference implementation.

We would love to explore how best to collaborate on these requirements and contribute to the project.
Our team has extensive experience contributing to multiple CNCF projects, working with GoLang, and implementing real-world routing of inference workload for enterprise customers.

What would be the best way to engage in discussions and contribute to this effort?
Looking forward to your thoughts.

kfswain · 2025-02-03T21:32:55Z

Heya Nir! I think we met a teammate of yours Joe last week! We had him create a discussion on our repo: #264

We are trying to solve a lot of the same problems as you!

smarterclayton · 2025-02-07T01:47:24Z

As per the discussion today:

Identifying whether dynamic prioritization per request could be passed along from a higher proxy, vs allowing an extension / composable strategy to manage requests in flight based on external logic
Prefix caching should be on our roadmap - id slightly prefer a scoring function rather than true affinity, since the cost to reuse prefix cache has to be weighed against current load. But affinity may have other value.
We need to more clearly establish how model name can be used to identify consumers, document our intended scale dimensions guiding the current API, and discuss how we desire to keep resource aware processing distinct from business logic

I suggest we continue the discussion in more specific proposals with the folks exploring 1 and 2

nirrozenbaum changed the title ~~routing inference workloads - requirements dicussion~~ routing inference workloads - requirements discussion Feb 4, 2025

nirrozenbaum mentioned this issue Mar 5, 2025

REQUEST: New membership for nirrozenbaum kubernetes/org#5439

Closed

11 tasks

nirrozenbaum closed this as completed Apr 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

routing inference workloads - requirements discussion #276

routing inference workloads - requirements discussion #276

nirrozenbaum commented Feb 3, 2025

kfswain commented Feb 3, 2025

Uh oh!

smarterclayton commented Feb 7, 2025

Uh oh!

routing inference workloads - requirements discussion #276

routing inference workloads - requirements discussion #276

Comments

nirrozenbaum commented Feb 3, 2025

kfswain commented Feb 3, 2025

Uh oh!

smarterclayton commented Feb 7, 2025

Uh oh!