You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
My name is Nir Rozenbaum, and I am part of a team at IBM Research working on routing of inference workloads.
We believe the Gateway API inference extension could be a strong fit for our needs, and we would like to discuss some key requirements based on our customer use cases.
Some of the requirements we would like to raise include:
Serving Request Priority (SLA-based Routing): In our use case, request prioritization is determined by customer SLAs, ensuring that higher-paying customers receive priority over lower-tier or internal users. While the current design allows prioritization based on model criticality, we need a mechanism to prioritize requests dynamically based on SLA tiers. For example, inference requests from IBMers using IBM Cloud for internal purposes (free tier) should be deprioritized in favor of paying customers. Similarly, customers on a premium plan should experience lower wait times than those on a standard plan.
Session Affinity & Cache-Aware Routing: We need the ability to route requests to a specific vLLM pod based on session headers. This ensures that inference requests within the same session are consistently directed to the same vLLM instance, improving efficiency and caching performance.
Maintaining Existing Routing Logic: These capabilities should be additive, complementing the current model-based and LoRA model-aware routing mechanisms rather than replacing them.
Additionally, we are looking into addressing scalability and fault tolerance challenges in the current reference implementation.
We would love to explore how best to collaborate on these requirements and contribute to the project.
Our team has extensive experience contributing to multiple CNCF projects, working with GoLang, and implementing real-world routing of inference workload for enterprise customers.
What would be the best way to engage in discussions and contribute to this effort?
Looking forward to your thoughts.
The text was updated successfully, but these errors were encountered:
Identifying whether dynamic prioritization per request could be passed along from a higher proxy, vs allowing an extension / composable strategy to manage requests in flight based on external logic
Prefix caching should be on our roadmap - id slightly prefer a scoring function rather than true affinity, since the cost to reuse prefix cache has to be weighed against current load. But affinity may have other value.
We need to more clearly establish how model name can be used to identify consumers, document our intended scale dimensions guiding the current API, and discuss how we desire to keep resource aware processing distinct from business logic
I suggest we continue the discussion in more specific proposals with the folks exploring 1 and 2
Hi,
My name is Nir Rozenbaum, and I am part of a team at IBM Research working on routing of inference workloads.
We believe the Gateway API inference extension could be a strong fit for our needs, and we would like to discuss some key requirements based on our customer use cases.
Some of the requirements we would like to raise include:
Serving Request Priority (SLA-based Routing): In our use case, request prioritization is determined by customer SLAs, ensuring that higher-paying customers receive priority over lower-tier or internal users. While the current design allows prioritization based on model criticality, we need a mechanism to prioritize requests dynamically based on SLA tiers. For example, inference requests from IBMers using IBM Cloud for internal purposes (free tier) should be deprioritized in favor of paying customers. Similarly, customers on a premium plan should experience lower wait times than those on a standard plan.
Session Affinity & Cache-Aware Routing: We need the ability to route requests to a specific vLLM pod based on session headers. This ensures that inference requests within the same session are consistently directed to the same vLLM instance, improving efficiency and caching performance.
Maintaining Existing Routing Logic: These capabilities should be additive, complementing the current model-based and LoRA model-aware routing mechanisms rather than replacing them.
Additionally, we are looking into addressing scalability and fault tolerance challenges in the current reference implementation.
We would love to explore how best to collaborate on these requirements and contribute to the project.
Our team has extensive experience contributing to multiple CNCF projects, working with GoLang, and implementing real-world routing of inference workload for enterprise customers.
What would be the best way to engage in discussions and contribute to this effort?
Looking forward to your thoughts.
The text was updated successfully, but these errors were encountered: