You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[v0.1 API Review] Cleaning up optional fields/clearer wording (#185)
* feedback updates
* updating the actual API obj as well
* Generating new manifests/ Additional Criticality wording
* More feedback updates
* generating new manifests
* feedback updates
* generated code
* build fix
* test fixes
* Renaming the 'base' Criticality band to 'Standard'
* detailing how to handle unset criticality
// TargetModels allow multiple versions of a model for traffic splitting.
@@ -83,6 +89,7 @@ type InferenceModelSpec struct {
83
89
//
84
90
// +optional
85
91
// +kubebuilder:validation:MaxItems=10
92
+
// +kubebuilder:validation:XValidation:message="Weights should be set for all models, or none of the models.",rule="self.all(model, has(model.weight)) || self.all(model, !has(model.weight))"
// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default.
131
+
// This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
Copy file name to clipboardExpand all lines: docs/proposals/002-api-proposal/proposal.md
+21-19
Original file line number
Diff line number
Diff line change
@@ -78,6 +78,7 @@ The API design is based on these axioms:
78
78
- This solution should be composable with other Gateway solutions and flexible to fit customer needs
79
79
- The MVP will heavily assume requests are done using the OpenAI spec, but open to extension in the future
80
80
- The Gateway should route in a way that does not generate a queue of requests at the model server level
81
+
- Model serving differs from web-serving in critical ways. One of these is the existence of multiple models for the same service, which can materially impact behavior, depending on the model served. As opposed to a web-service that has mechanisms to render implementation changes invisible to an end user
81
82
82
83
The [PoC](https://youtu.be/NUBZg_uqqXk?si=v681EeYdGUGEVqQQ&t=1458) was focused on lower-level scheduling. And the API follows that similar logic, which lead to the proposal of the **InferencePool**.
83
84
@@ -126,27 +127,21 @@ type InferencePool struct {
126
127
127
128
typeInferencePoolSpecstruct {
128
129
// ModelServerSelector uses label selection to watch model server pods
129
-
// that should be included in the InferencePool. ModelServers should not
130
-
// be with any other Service or InferencePool, that behavior is not supported
// InferenceModel represents a set of Models/Adapters that are multiplexed onto one
139
-
// or more Inferencepools. This resource is managed by the "Inference Workload Owner"
138
+
// or more InferencePools. This resource is managed by the "Inference Workload Owner"
140
139
// persona. The Inference Workload Owner persona is: a team that trains, verifies, and
141
140
// leverages a large language model from a model frontend, drives the lifecycle
142
141
// and rollout of new versions of those models, and defines the specific
143
142
// performance and latency goals for the model. These workloads are
144
-
// expected to operate within an InferencePool sharing compute capacity with other
145
-
// InferenceModels, defined by the Inference Platform Admin. We allow a user who
146
-
// has multiple InferenceModels across multiple pools (with the same config) to
147
-
// specify the configuration exactly once, and deploy to many pools
148
-
// simultaneously. Enabling a simpler config and single source of truth
149
-
// for a given user. InferenceModel ModelNames are unique for a given InferencePool,
143
+
// expected to coexist within an InferencePool: sharing compute capacity with other
144
+
// InferenceModels, with sharing limitations defined by the Inference Platform Admin.
150
145
typeInferenceModelstruct {
151
146
metav1.ObjectMeta
152
147
metav1.TypeMeta
@@ -155,28 +150,35 @@ type InferenceModel struct {
155
150
}
156
151
157
152
typeInferenceModelSpecstruct {
158
-
// The name of the model as the users set in the "model" parameter in the requests.
159
-
// The name should be unique among the workloads that reference the same backend pool.
160
-
// This is the parameter that will be used to match the request with. In the future, we may
161
-
// allow to match on other request parameters. The other approach to support matching on
162
-
// on other request parameters is to use a different ModelName per HTTPFilter.
163
-
// Names can be reserved without implementing an actual model in the pool.
153
+
// The name of the model as it will be set in the "model" parameter for an incoming request.
154
+
// ModelNames are expected to be unique for a specific InferencePool
155
+
// (names can be reused for a different pool in the same cluster).
156
+
// The modelName with the oldest creation timestamp is retained, and the incoming
157
+
// InferenceModel is sets the Ready status to false with a corresponding reason.
158
+
// In the rare case of a race condition, one Model will be selected randomly to be considered valid, and the other rejected.
159
+
// Names can be reserved without an underlying model configured in the pool.
164
160
// This can be done by specifying a target model and setting the weight to zero,
165
161
// an error will be returned specifying that no valid target model is found.
166
162
ModelNamestring
167
163
// Optional
168
164
// Defines how important it is to serve the model compared to other models referencing the same pool.
165
+
// Criticality impacts how traffic is handled in resource constrained situations. It handles this by
166
+
// queuing or rejecting requests of lower criticality. InferenceModels of an equivalent Criticality will
167
+
// fairly share resources over throughput of tokens. In the future, the metric used to calculate fairness,
168
+
// and the proportionality of fairness will be configurable.
169
169
Criticality *Criticality
170
170
// Optional.
171
-
// Allow multiple versions of a model for traffic splitting.
172
-
// If not specified, the target model name is defaulted to the ModelName parameter.
171
+
// Allow multiple versions of a model for traffic splitting.
172
+
// If not specified, the target model name is defaulted to the ModelName parameter.
173
173
// ModelName is often in reference to a LoRA adapter.
174
174
TargetModels []TargetModel
175
175
// Reference to the InferencePool that the model registers to. It must exist in the same namespace.
176
176
PoolReference *LocalObjectReference
177
177
}
178
178
179
179
// Defines how important it is to serve the model compared to other models.
180
+
// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field should ALWAYS be optional(use a pointer), and set no default.
181
+
// This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
180
182
typeCriticalitystring
181
183
const (
182
184
// Most important. Requests to this band will be shed last.
@@ -200,7 +202,7 @@ type TargetModel struct {
200
202
Namestring
201
203
// Weight is used to determine the percentage of traffic that should be
202
204
// sent to this target model when multiple versions of the model are specified.
203
-
Weightint
205
+
Weight*int
204
206
}
205
207
206
208
// LocalObjectReference identifies an API object within the namespace of the
0 commit comments