EPP Multi-tenancy #724

sriumcp · 2025-04-22T15:28:41Z

I have a question re: guidance for implementers.

Is the intent behind the current inference model and inference pool design the following?

There is namespace isolation between base models: specifically, each base model gets deployed in its own k8s namespace.
There is an InferencePool that targets a given base model. So, exactly one inference pool (and one base model) per k8s namespace.
There can be multiple LoRA adapters for a given base model. All LoRA adapters must be loaded onto all pods for the given base model.

I'm not sure about upcoming enhancements to the CRDs, but I am trying to understand if above is the manner in which the current CRDs are intended to be used.

Thanks in advance for your clarifications!

The text was updated successfully, but these errors were encountered:

kfswain · 2025-04-22T16:03:12Z

The InferencePool is a grouping of compute, typically model servers that all share the same base model, yes, and it is namespace scoped, also yes. But just to clarify, it is intended to be able to have multiple InferencePools in the same namespace even if the pools are of the same base model, so long as there are no overlapping model servers(pods) in the selector. So 2. is not correct (unless there is a bug I'm unaware of).

3 is currently correct

sriumcp · 2025-04-22T16:18:25Z

Follow up question.

Re: epp, is the intent to have possibly a single epp deployment that can be referenced within multiple inference pools (that may be created in different namespaces)?

kfswain · 2025-04-22T16:25:23Z

This has come up quite a bit, I think the jury is still out. Personally, I'm concerned that multi-tenancy could turn out to be an anti-pattern as it creates a single point of failure, and applies pressure to any scale issues that may occur.

For context: we intend to support more inference-routing specific features such as: Prefix Aware Routing, which will require quite a bit of memory space on the EPP, additionally we expect to have callouts for things like RAG, or tokenization of the input (just for examples). This will require quite a bit more computational & memory overhead. I think multi-tenancy would hit scale limits faster

kfswain · 2025-04-22T16:27:02Z

I added it to our agenda for our weekly Th meeting as this has come up enough recently, if you have time to join and have opinions, would love to hear them there.

meeting info here: https://github.com/kubernetes-sigs/gateway-api-inference-extension?tab=readme-ov-file#contributing

sriumcp · 2025-04-22T16:31:16Z

It seems to me that inference pools inside the same namespace should have the option of referring to the same epp.

This enables isolation across namespaces, and also reuse of epp within a single namespace.

kfswain · 2025-04-24T19:48:19Z

We discussed this in the OSS meeting today some, when the recording is available i can link it, do you have a use case for reusing the EPP within a namespace? Is it simpler ops?

kfswain changed the title ~~Design clarification question for implementers~~ EPP Multi-tenancy Apr 24, 2025

kubernetes-sigs locked and limited conversation to collaborators Apr 24, 2025

kfswain converted this issue into discussion #736 Apr 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

EPP Multi-tenancy #724

EPP Multi-tenancy #724

sriumcp commented Apr 22, 2025 •

edited

Loading

kfswain commented Apr 22, 2025

sriumcp commented Apr 22, 2025

kfswain commented Apr 22, 2025

kfswain commented Apr 22, 2025

sriumcp commented Apr 22, 2025 •

edited

Loading

kfswain commented Apr 24, 2025

This issue was moved to a discussion.

This issue was moved to a discussion.

EPP Multi-tenancy #724

EPP Multi-tenancy #724

Comments

sriumcp commented Apr 22, 2025 • edited Loading

kfswain commented Apr 22, 2025

sriumcp commented Apr 22, 2025

kfswain commented Apr 22, 2025

kfswain commented Apr 22, 2025

sriumcp commented Apr 22, 2025 • edited Loading

kfswain commented Apr 24, 2025

This issue was moved to a discussion.

sriumcp commented Apr 22, 2025 •

edited

Loading

sriumcp commented Apr 22, 2025 •

edited

Loading