@@ -97,6 +97,7 @@ SIG Architecture for cross-cutting KEPs).
97
97
- [ Ephemeral vs. persistent ResourceClaims lifecycle] ( #ephemeral-vs-persistent-resourceclaims-lifecycle )
98
98
- [ Coordinating resource allocation through the scheduler] ( #coordinating-resource-allocation-through-the-scheduler )
99
99
- [ Resource allocation and usage flow] ( #resource-allocation-and-usage-flow )
100
+ - [ Scheduled pods with unallocated or unreserved claims] ( #scheduled-pods-with-unallocated-or-unreserved-claims )
100
101
- [ API] ( #api )
101
102
- [ resource.k8s.io] ( #resourcek8sio )
102
103
- [ core] ( #core )
@@ -1118,6 +1119,49 @@ If a Pod references multiple claims managed by the same driver, then the driver
1118
1119
can combine updating ` podSchedulingContext.claims[*].unsuitableNodes ` for all
1119
1120
of them, after considering all claims.
1120
1121
1122
+ ### Scheduled pods with unallocated or unreserved claims
1123
+
1124
+ There are several scenarios where a Pod might be scheduled (= ` pod.spec.nodeName `
1125
+ set) while the claims that it depends on are not allocated or not reserved for
1126
+ it:
1127
+
1128
+ * A user might manually create a pod with ` pod.spec.nodeName ` already set.
1129
+ * Some special cluster might use its own scheduler and schedule pods without
1130
+ using kube-scheduler.
1131
+ * The feature might have been disabled in kube-scheduler while scheduling
1132
+ a pod with claims.
1133
+
1134
+ The kubelet is refusing to run such pods and reports the situation through
1135
+ an event (see below). It's an error scenario that should better be avoided.
1136
+
1137
+ Users should avoid this situation by not scheduling pods manually. If they need
1138
+ it for some reason, they can use a node selector which matches only the desired
1139
+ node and then let kube-scheduler do the normal scheduling.
1140
+
1141
+ Custom schedulers should emulate the behavior of kube-scheduler and ensure that
1142
+ claims are allocated and reserved before setting ` pod.spec.nodeName ` .
1143
+
1144
+ The last scenario might occur during a downgrade or because of an
1145
+ administrator's mistake. Administrators can fix this by deleting such pods or
1146
+ ensuring that claims are usable by them. The latter is work that can be
1147
+ automated in kube-controller-manager:
1148
+
1149
+ - If ` pod.spec.nodeName ` is set, kube-controller-manager can be sure that
1150
+ kube-scheduler is not doing anything for the pod.
1151
+ - If such a pod has unallocated claims, kube-controller-manager can
1152
+ create a ` PodSchedulingContext ` with only the ` spec.selectedNode ` field set
1153
+ to the name of the node chosen for the pod. There is no need to list
1154
+ suitable nodes because that choice is permanent, so resource drivers don't
1155
+ need check for unsuitable nodes. All that they can do is to (re)try allocating
1156
+ the claim until that succeeds.
1157
+ - If such a pod has allocated claims that are not reserved for it yet,
1158
+ then kube-controller-manager can (re)try to reserve the claim until
1159
+ that succeeds.
1160
+
1161
+ Once all of those steps are complete, kubelet will notice that the claims are
1162
+ ready and run the pod. Until then it will keep checking periodically, just as
1163
+ it does for other reasons that prevent a pod from running.
1164
+
1121
1165
### API
1122
1166
1123
1167
The PodSpec gets extended. To minimize the changes in core/v1, all new types
@@ -1749,6 +1793,12 @@ In addition to updating `claim.status.reservedFor`, kube-controller-manager also
1749
1793
ResourceClaims that are owned by a completed pod to ensure that they
1750
1794
get deallocated as soon as possible once they are not needed anymore.
1751
1795
1796
+ Finally, kube-controller-manager tries to make pods runnable that were
1797
+ [ scheduled to a node
1798
+ prematurely] ( #scheduled-pods-with-unallocated-or-unreserved-claims ) by
1799
+ triggering allocation and reserving claims when it is certain that
1800
+ kube-scheduler is not going to handle that.
1801
+
1752
1802
### kube-scheduler
1753
1803
1754
1804
The scheduler plugin for ResourceClaims ("claim plugin" in this section)
0 commit comments