Consider backend augmentation vs new backend type #725

howardjohn · 2025-03-17T22:10:32Z

howardjohn
Mar 17, 2025

The InferencePool defines a new backend type. Ignoring a few things, this essentially defines a Service but with a different load balancing selection.

Gateway API already provides two mechanisms to augment the behavior of a backend: a filter on the backendRef, or a Policy attachment.

I would argue one of these methods may be more appropriate than defining a new backend type.

A major problem with using the "new backend type" pattern for something like this, IMO is the lack of composability.

To make things simpler, let me change from discussing InferencePool to a strawman backendRef: a RoundRobinBackend type, which controls how to load balance over a set of pods. For example:

kind: RoundRobinBackend
spec:
  targetPort: 9000
  podSelector:
    app: bar

Now a user has a use case to add TLS to the backend as well. One approach they could take is to build a new backend type: TLSOriginationBackend. But this has a problem:

If the policy has a podSelector as the target, we cannot compose them at all. A user can chose to use only TLS or round robin LB, when they could perfectly reasonably want to use both.
Maybe I make my policies more complex, so RoundRobinBackend can reference Pod|arbitrary backend. This quick gets quite obnoxious. Not only could I end up with RoundRobinBackend<TLSOriginationBackend<....<Pods>>, implementations need to know about all of the types.

This is not hypothetical either: Envoy Gateway has an AIServiceBackend type, and for the implementation of InferencePool they are considering AiServiceBackend pointing to an InferencePool. So you have AIService<InferencePool<Pod>> (ref) (note: I am not involved in Envoy Gateway, so merely a bystander reading the issue).

Another problem with this approach is on implementations. Because we have made InferencePool essentially "Service lite + some other stuff", each controller needs to become a "Service controller lite". Typically, this job is delegated to the EndpointSlice controller in Kubernetes, and all existing gateway API controllers read EndpointSlice to determine the endpoints to include in Service references.

This leaves a few options:

A controller must implement their own InferencePool<->Pod selection. This sounds simple, but, speaking as an implementation that has done this for other resource types, is incredibly challenging to do correctly
A controller can create a Service object behind the scene. This feels very hacky and not like the proper long term solution

I would propose that InferencePool should instead augment an existing backend type (Service). This allows better composition, simplification for controllers, and is a bit more standard pattern seen in the ecosystem. Additionally, users will probably have a long tail of feature requests for new functionality in InferencePool that already exists in Service (like named target ports, publishNotReadyAddresses, etc) which can be avoided.

cc @LiorLieberman @louiscryan @robscott @danehans

LiorLieberman · 2025-03-19T21:16:00Z

LiorLieberman
Mar 19, 2025

Thanks! This makes a lot of sense. I also know you had a few ways in mind to achieve that. One is policy attachment, were there other ways?

Mind listing a few ideas to kickstart the discussion?

1 reply

howardjohn Apr 22, 2025
Author

Policy attachment is the way I had in mind. It could also be a backend filter but I don't think that's a good idea

kfswain · 2025-04-22T15:55:00Z

kfswain
Apr 22, 2025
Maintainer

@howardjohn can you respond to the above comment?

0 replies

robscott · 2025-05-15T19:37:28Z

robscott
May 15, 2025

Thanks for raising this point @howardjohn! I know we've discussed this a lot, but I failed to actually respond on GitHub. The decision to use a custom backend type here instead of Service was inspired by several reasons:

The Kubernetes Service API is already wildly overloaded. There's a concerted effort in SIG-Network to move away from that, and it's near impossible at this point to add any new fields to this API. Fundamentally, that means that starting a new project based on top of Service would mean that a core part of our API surface was largely unchangeable (we have lots of experience with the pain of this in GW API).
We want to ensure that all traffic goes through our custom routing. If we were using a Service, users would assume that sending a request to that Service would result in these advanced routing capabilities, when in reality that traffic would have to go through a Gateway to get the additional capabilities. This could result in many users thinking they were using our project, seeing no change in performance, and moving on. I realize that meshes can do some magical redirection here to intercept a request to a Service and send it through these routing extensions, but I don't think we want to make this project dependent on mesh. (It should absolutely work with meshes, I just don't want it to be a prerequisite for using the project altogether).
Due to the sheer number of capabilities of the Service API, using it would open up far more room for confusion. What if a user sets trafficDistribution, internalTrafficPolicy, sessionAffinityConfig, or other similarly confusing fields that would almost certainly be ignored by an EPP? It seems like we'd have to document that the only field honored by EPP on Service would be selector, which raises the question of why we'd use such a large API for a single field.

It is worth digging into alternatives though, because there are distinct advantages to each of them.

1) Policy Attachment
With this approach, we'd create a new InferencePolicy that could augment a Service. This would include configuration pointing to the EPP to use, as well as whether the extension should fail open or closed. Presumably this would expand in the future to support other Inference-specific concepts.

Advantages:

Simpler for implementers. They don't need to either recreate EndpointSlice controller or create a Service to mirror the InferencePool.
Simpler for integrations like Argo that currently hardcode "Service" as the only kind of supported backend.

Disadvantages:

See points above around limitations of using Service
Policy attachment is confusing and meant to be something we use as a last resort - it's rather difficult to discover + unclear if your Gateway implementation of choice is actually honoring the policy
Requires more resources in total, even for the simplest case ([Gateway -> Route -> Service <- InferencePolicy] instead of [Gateway -> Route -> InferencePool])
Unclear what models attach to - do they attach to a Service? An InferencePolicy? Neither feels right

2) Backend Filter
With this approach, we'd create a new filter on HTTPRoute that could call out to an extension. Given the presence in the core Gateway API, it would likely need to be quite generic and could not contain Inference-specific configuration.

Advantages:

Simpler for implementers. They don't need to either recreate EndpointSlice controller or create a Service to mirror the InferencePool.
Simpler for integrations like Argo that currently hardcode "Service" as the only kind of supported backend.

Disadvantages:

See points above around limitations of using Service
In the relatively common case where multiple Route rules point to the same backend, it will be difficult/annoying to remember to configure the same extension for each. If this is missed, the EPP will be lacking visibility for some of the requests reaching the backends, and will have suboptimal performance
It's unlikely that many implementations would be able to support extension callouts per route rule, especially as we expand beyond Envoy-based implementations
Unclear what models attach to - do they attach to a Service? That doesn't seem right
Unclear where Inference-specific config lives that applies to an entire pool - would we need an InferencePolicy anyway?

Additional Thoughts on Implementation Complexity
There's already work going on in upstream to pull out EndpointSlice controller into a library to support different use cases. I'd rather focus on continuing the work to make EndpointSlice controller more flexible than to artificially limit our APIs due to some implementation complexity that can be tempory.

2 replies

howardjohn May 15, 2025
Author

Thanks Rob. Great context to have. I don't disagree with anything you mentioned but do still feel augmentation would be preferable - likely we just weight the various factors differently.

My primary concern which isn't really addressed by your comments is the composition problem. Effectively by making ourselves a custom backend we are busting up an entire ecosystem (current nascent) of extensions and policies by declaring this as the one extension that fills the backend slot. It's a bad precedent IMO.

Also I feel the concerns against service overloading,while valid,are a bit of a class "deprecation without viable replacement" situation. I know this is a controversial topic here so I won't say much more than I disagree 🙂.

EndpointSlice controller into a library to support different use cases.

While this is great fwiw on the 3-4 projects I've worked on that this would be relevant I don't think it would be particularly consumable just due to the various ways the controllers are architected.

Implementor simplicity is not the primary driver here anyways.

It should absolutely work with meshes, I just don't want it to be a prerequisite for using the project altogether).

Fwiw this means they also need to implement IP allocation, DNS, on top of endpoints. Istio does already do this, just pointing it out

robscott May 15, 2025

My primary concern which isn't really addressed by your comments is the composition problem. Effectively by making ourselves a custom backend we are busting up an entire ecosystem (current nascent) of extensions and policies by declaring this as the one extension that fills the backend slot. It's a bad precedent IMO.

I'm not sure I follow this. The extensions that follow Gateway's policy attachment model should be able to apply equally to an InferencePool, Service, or ServiceImport, depending on the goal. Part of the goal of that model was to enable modularity with the idea that no single backend type would be sufficient for all the possible use cases.

With that said, I think you're suggesting that this should fall under the category of "policy" instead of being a custom backend type, otherwise we'd risk all future extensions of GW API just defining custom backend types. I agree that that's a risk if we don't provide sufficient guardrails/guidelines for this in the GW API ecosystem. So here's a theoretical starting point:

Should be a custom backend:

Describes a set of backends with meaningfully different characteristics than other backends
Acts as an attachment point for other extensions or config that are specific to this use case
Is sufficiently unique that the majority of configuration defined by other existing backend types would be irrelevant and ignored
Attributes of existing backend types are counterproductive for the use case (ie ClusterIP routing would bypass desired custom routing logic)

Should be a policy:

Augments a backend type with generic behavior that is broadly applicable (retry config, timeouts, etc)
Insufficiently portable to be configured natively on the backend type
Benefits from having separate RBAC permissions from backend configuration (ie security or auth config)
Benefits from being reused for many distinct backend groups

IMO, InferencePool meets all the criteria for being a custom backend, and also at least the first 2 points in favor of being a policy. So although one could make a compelling argument in either direction, I believe that a custom backend is still a better fit even if we just consider the above criteria in isolation.

While this is great fwiw on the 3-4 projects I've worked on that this would be relevant I don't think it would be particularly consumable just due to the various ways the controllers are architected.

Agree that it would be difficult to integrate directly into Gateway or other related controllers. With that said, we could consider running a custom EPS controller for this purpose and/or extending the scope of the existing EPS controller to support this use case. The overarching idea here is that there's already movement on other fronts like multi-network and ClusterIP Gateways to make it easier to run custom EndpointSlice controllers and/or include them in tree, I'd rather plan on that long term than forever be limited by this and keep on coming back to Service because it's the only API that comes with EndpointSlices today.

louiscryan · 2025-05-22T22:55:15Z

louiscryan
May 22, 2025

I think these criteria are a decent starting point but I think in the context of the specific solution (picked backend) some refinement is needed.

Some observations:

InferencePool is undifferentiated from Service in its definition of the member endpoint set.
The 'endpoint picker' mechanism is not a mechanism to define the member endpoints of a service but a utility to choose among them in an out-of-band manner. i.e it is an LB/router function. Its functional interface is completely generic and equally applicable to any high-latency high-cost service where affinity classes to subsets of endpoints can yield substantial resource savings. Nothing in that interface is LLM specific other than the branding
While there is functional overlap between the inline LB controlling features of Service and a hypothetical Service attached LB policy like the one @howardjohn suggested, the fallback routing behavior if EPP fails would need to depend on those mechanisms for the naive case and implementations should respect things like PreferLocal etc.
While Service allows for several other facilities, some of which are likely redudant, its not clear that they all are. You could easily see multiple ports being useful and the allocation of ClusterIP while on by default can be disabled. In other words is the 'noise' of the Service API sufficient to justify the creation of an entirely new overlapping type.

Taking your criteria above in this context:

Custom Backend:

Since it inherits the behavior of a Service in the fail-open case and is not meaningfully different than any other high-cost service leveraging an EPP style LB.
I have not seen examples of policies that attach that are different from other Service types. So far only EPP has been motivated by this use-case and that's a generic capability in reality. Other attached policies I've seen in use-cases are pretty generic too: authz, logging, ... This is certainly the case in Solo's AI Gateway product
"is sufficiently unique" - See (1) above
"the Service API noise issue" - While I don't think this is a hard security issue - for which an Authz policy is needed anyway - I do agree that the Service API has some pork. Some of which is actually necessary in this case (fall-back LB) I'm not sure the separation is worth the marginal UX improvement. I do agree that argument has been made for Service 100 times and it is pork is a result of this but I don't think making a generic subset domain specific as is being done here solves the issue or sets a good precedent.

Policy Decoration:

"Augments with a generic behavior". For starters I think the behavior in this case is generic. TLS is pretty generic too but we're proposing BackendTLSPolicy, largely because it requires a substantial amount of parameterization and to some extent because not every solution supporting the Service API can or should implement it. I actually think its fine for attached policies to be non-generic since this is the path most vendors will be forced into to offer differentiated behaviors on top of Service in an incremental manner.
"Insuffuciently portable to be supported natively...". I don't think this holds water. What does 'native' mean in this context vs a requirement on an implementer. Implementers are absolutely capable of supporting a reference attached policy type but core K8S could not.
"Benefits from having separate RBAC permissions" seems moot in this case since it really only affects LB and not access to the endpoints in any way. There are probably other cases where this is true so I agree its a criteria, just not a required/exclusionary one when this is not a requirement.
"Reuse" - There is almost certainly some re-use benefit here. The EPP service is very likely to be a shared resource with a pretty fixed set of behaviors. This is what I see in llm-d for example. Again, not a huge concern either way given that typical users will run small N counts of inference systems and large SaaS providers will be more than willing to bear the cost of duplication for squeezing out marginal improvement.

In short I don't think the API is currently justifying its weight and that an EPP-LB policy type attached to Service would work and be generically useful

There are alternates where some of the criteria hold more weight but it's not clear they are desirable outcomes:. Specifically if EPP was not an LB but defined the backend endpoint set entirely virtually. This would imply no fail-open capability however.

There are a bunch of other non-API related considertations that should weigh in favor of using Service and policy attachment. Chief among these is all the visualization and traffic analysis ecosystem that has built up around it.

I completely understand the desire to have some LLM branding in the APIs so people know its 'for' that use-case. I' d be fine with calling the EPP binding InferenceLBPolicy or somesuch even though its totally generic behavior.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider backend augmentation vs new backend type #725

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Consider backend augmentation vs new backend type #725

Uh oh!

howardjohn Mar 17, 2025

Replies: 4 comments · 3 replies

Uh oh!

Uh oh!

LiorLieberman Mar 19, 2025

Uh oh!

howardjohn Apr 22, 2025 Author

Uh oh!

kfswain Apr 22, 2025 Maintainer

Uh oh!

robscott May 15, 2025

Uh oh!

howardjohn May 15, 2025 Author

Uh oh!

Uh oh!

robscott May 15, 2025

Uh oh!

louiscryan May 22, 2025

howardjohn
Mar 17, 2025

Replies: 4 comments 3 replies

LiorLieberman
Mar 19, 2025

howardjohn Apr 22, 2025
Author

kfswain
Apr 22, 2025
Maintainer

robscott
May 15, 2025

howardjohn May 15, 2025
Author

louiscryan
May 22, 2025