Introduce EPP-Level Queuing/Fairness for InferenceModels #674

LukeAVanDrie · 2025-04-10T00:45:53Z

Current State

Currently, the Gateway API Inference Extension handles requests by directly invoking backend selection logic for each incoming request at a point in time as they are received. This results in a First-Come-First-Serve (FCFS) dispatch order based on request arrival, with backend choice dependent on instantaneous pool state (LoRA affinity, KV cache, backend queue lengths, etc.).

Limitations

This direct dispatch model lacks mechanisms to:

Guarantee Criticality-Based Service Differentiation: Cannot ensure Critical requests are always dispatched before Default or Sheddable ones when InferenceModels are competing over limited pool resources.
Enforce Inter-Model (Dispatch/Fairness) Policies: Lacks a central point to apply dispatch policies (e.g., FCFS, Round Robin or other fairness definitions) between different InferenceModels of the same criticality.
Optimally Handle Contention/Saturation: Forces an immediate backend choice or request drop decision, potentially leading to suboptimal load distribution or unnecessary request failures when backends are only temporarily busy.

Feature Request: Centrally Queue Requests in the EPP before the Backend Selection Decision

Introduce a Queuing/Fairness Layer before the backend selection step. This involves:

Queuing requests per InferenceModel, grouped by criticality.
Implementing an inter-criticality dispatch mechanism ensuring strict priority (Critical > Default > Sheddable).
Allowing pluggable inter-model dispatch policies (e.g., FCFS, Round Robin, or other fairness definitions) within a priority band to manage inter-model fairness.
Basic queue management (TTL, limits)

Why Add Another Layer of Queuing?

Enforces Criticality: Guarantees priority before resources are committed via backend selection.
Enables Fairness Policies: Provides the necessary state and control point for inter-model dispatch logic.
Improved Contention Management:
- Decouples which request is dispatched next from where it goes.
- Allows requests to wait for better backend states instead of immediate suboptimal dispatch or dropping.
- Potentially improves load distribution across the pool by considering the global demand of pending requests when dispatching, reducing the chance of overloading specific backends.
- Supports more intelligent back pressure signals for upstream components.
- Shifts Head-of-Line (HoL) Blocking: While HoL blocking can occur in model server queues today, introducing a EPP-level queue shifts this potential blocking point "left". The benefit is that when a request at the head of the EPP queue is dispatched, the system retains the flexibility to choose the best available backend from the entire pool at that moment, rather than having committed the request prematurely to a specific, potentially suboptimal, model server queue where it might block others.
Potential Performance Gains: By making more informed dispatch decisions (enabled by waiting and the global view) and improving load distribution (enabled by better backend selection flexibility for HoL requests), this approach may improve tail latency and overall throughput, especially near saturation points, compared to locking a request into a specific backend queue early (avoiding scheduling regret). While adding another queuing layer introduces new sources of queuing latency, the goal is for these gains to offset that overhead.

The text was updated successfully, but these errors were encountered:

LukeAVanDrie · 2025-04-18T22:48:17Z

Here is the draft proposal: https://docs.google.com/document/d/1VZL7opFWuwgWquvgiOzLlXAJ633qZ9U-A0ZixGjBgaI/edit?usp=sharing

kfswain · 2025-04-24T20:35:24Z

/assign LukeAVanDrie

danehans mentioned this issue Apr 21, 2025

EPP Architecture proposal #683

Merged

kfswain mentioned this issue Apr 23, 2025

v0.4 Release Tracker #681

Open

17 tasks

k8s-ci-robot assigned LukeAVanDrie Apr 24, 2025

kfswain added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce EPP-Level Queuing/Fairness for InferenceModels #674

Introduce EPP-Level Queuing/Fairness for InferenceModels #674

LukeAVanDrie commented Apr 10, 2025

LukeAVanDrie commented Apr 18, 2025

kfswain commented Apr 24, 2025

Introduce EPP-Level Queuing/Fairness for InferenceModels #674

Introduce EPP-Level Queuing/Fairness for InferenceModels #674

Comments

LukeAVanDrie commented Apr 10, 2025

Current State

Limitations

Feature Request: Centrally Queue Requests in the EPP before the Backend Selection Decision

Why Add Another Layer of Queuing?

LukeAVanDrie commented Apr 18, 2025

kfswain commented Apr 24, 2025