Skip to content

routing inference workloads - requirements discussion #276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nirrozenbaum opened this issue Feb 3, 2025 · 2 comments
Closed

routing inference workloads - requirements discussion #276

nirrozenbaum opened this issue Feb 3, 2025 · 2 comments

Comments

@nirrozenbaum
Copy link
Contributor

Hi,
My name is Nir Rozenbaum, and I am part of a team at IBM Research working on routing of inference workloads.
We believe the Gateway API inference extension could be a strong fit for our needs, and we would like to discuss some key requirements based on our customer use cases.

Some of the requirements we would like to raise include:

  • Serving Request Priority (SLA-based Routing): In our use case, request prioritization is determined by customer SLAs, ensuring that higher-paying customers receive priority over lower-tier or internal users. While the current design allows prioritization based on model criticality, we need a mechanism to prioritize requests dynamically based on SLA tiers. For example, inference requests from IBMers using IBM Cloud for internal purposes (free tier) should be deprioritized in favor of paying customers. Similarly, customers on a premium plan should experience lower wait times than those on a standard plan.

  • Session Affinity & Cache-Aware Routing: We need the ability to route requests to a specific vLLM pod based on session headers. This ensures that inference requests within the same session are consistently directed to the same vLLM instance, improving efficiency and caching performance.

  • Maintaining Existing Routing Logic: These capabilities should be additive, complementing the current model-based and LoRA model-aware routing mechanisms rather than replacing them.

  • Additionally, we are looking into addressing scalability and fault tolerance challenges in the current reference implementation.

We would love to explore how best to collaborate on these requirements and contribute to the project.
Our team has extensive experience contributing to multiple CNCF projects, working with GoLang, and implementing real-world routing of inference workload for enterprise customers.

What would be the best way to engage in discussions and contribute to this effort?
Looking forward to your thoughts.

@kfswain
Copy link
Collaborator

kfswain commented Feb 3, 2025

Heya Nir! I think we met a teammate of yours Joe last week! We had him create a discussion on our repo: #264

We are trying to solve a lot of the same problems as you!

@nirrozenbaum nirrozenbaum changed the title routing inference workloads - requirements dicussion routing inference workloads - requirements discussion Feb 4, 2025
@smarterclayton
Copy link
Contributor

As per the discussion today:

  1. Identifying whether dynamic prioritization per request could be passed along from a higher proxy, vs allowing an extension / composable strategy to manage requests in flight based on external logic
  2. Prefix caching should be on our roadmap - id slightly prefer a scoring function rather than true affinity, since the cost to reuse prefix cache has to be weighed against current load. But affinity may have other value.
  3. We need to more clearly establish how model name can be used to identify consumers, document our intended scale dimensions guiding the current API, and discuss how we desire to keep resource aware processing distinct from business logic

I suggest we continue the discussion in more specific proposals with the folks exploring 1 and 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants