Skip to content

Commit 384510b

Browse files
authored
Merge pull request #4063 from pohly/dra-pre-scheduled-pods
KEP-3063: dra: pre-scheduled pods
2 parents edee7fc + 96c54f5 commit 384510b

File tree

1 file changed

+50
-0
lines changed
  • keps/sig-node/3063-dynamic-resource-allocation

1 file changed

+50
-0
lines changed

keps/sig-node/3063-dynamic-resource-allocation/README.md

+50
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,7 @@ SIG Architecture for cross-cutting KEPs).
9797
- [Ephemeral vs. persistent ResourceClaims lifecycle](#ephemeral-vs-persistent-resourceclaims-lifecycle)
9898
- [Coordinating resource allocation through the scheduler](#coordinating-resource-allocation-through-the-scheduler)
9999
- [Resource allocation and usage flow](#resource-allocation-and-usage-flow)
100+
- [Scheduled pods with unallocated or unreserved claims](#scheduled-pods-with-unallocated-or-unreserved-claims)
100101
- [API](#api)
101102
- [resource.k8s.io](#resourcek8sio)
102103
- [core](#core)
@@ -1118,6 +1119,49 @@ If a Pod references multiple claims managed by the same driver, then the driver
11181119
can combine updating `podSchedulingContext.claims[*].unsuitableNodes` for all
11191120
of them, after considering all claims.
11201121

1122+
### Scheduled pods with unallocated or unreserved claims
1123+
1124+
There are several scenarios where a Pod might be scheduled (= `pod.spec.nodeName`
1125+
set) while the claims that it depends on are not allocated or not reserved for
1126+
it:
1127+
1128+
* A user might manually create a pod with `pod.spec.nodeName` already set.
1129+
* Some special cluster might use its own scheduler and schedule pods without
1130+
using kube-scheduler.
1131+
* The feature might have been disabled in kube-scheduler while scheduling
1132+
a pod with claims.
1133+
1134+
The kubelet is refusing to run such pods and reports the situation through
1135+
an event (see below). It's an error scenario that should better be avoided.
1136+
1137+
Users should avoid this situation by not scheduling pods manually. If they need
1138+
it for some reason, they can use a node selector which matches only the desired
1139+
node and then let kube-scheduler do the normal scheduling.
1140+
1141+
Custom schedulers should emulate the behavior of kube-scheduler and ensure that
1142+
claims are allocated and reserved before setting `pod.spec.nodeName`.
1143+
1144+
The last scenario might occur during a downgrade or because of an
1145+
administrator's mistake. Administrators can fix this by deleting such pods or
1146+
ensuring that claims are usable by them. The latter is work that can be
1147+
automated in kube-controller-manager:
1148+
1149+
- If `pod.spec.nodeName` is set, kube-controller-manager can be sure that
1150+
kube-scheduler is not doing anything for the pod.
1151+
- If such a pod has unallocated claims, kube-controller-manager can
1152+
create a `PodSchedulingContext` with only the `spec.selectedNode` field set
1153+
to the name of the node chosen for the pod. There is no need to list
1154+
suitable nodes because that choice is permanent, so resource drivers don't
1155+
need check for unsuitable nodes. All that they can do is to (re)try allocating
1156+
the claim until that succeeds.
1157+
- If such a pod has allocated claims that are not reserved for it yet,
1158+
then kube-controller-manager can (re)try to reserve the claim until
1159+
that succeeds.
1160+
1161+
Once all of those steps are complete, kubelet will notice that the claims are
1162+
ready and run the pod. Until then it will keep checking periodically, just as
1163+
it does for other reasons that prevent a pod from running.
1164+
11211165
### API
11221166

11231167
The PodSpec gets extended. To minimize the changes in core/v1, all new types
@@ -1749,6 +1793,12 @@ In addition to updating `claim.status.reservedFor`, kube-controller-manager also
17491793
ResourceClaims that are owned by a completed pod to ensure that they
17501794
get deallocated as soon as possible once they are not needed anymore.
17511795

1796+
Finally, kube-controller-manager tries to make pods runnable that were
1797+
[scheduled to a node
1798+
prematurely](#scheduled-pods-with-unallocated-or-unreserved-claims) by
1799+
triggering allocation and reserving claims when it is certain that
1800+
kube-scheduler is not going to handle that.
1801+
17521802
### kube-scheduler
17531803

17541804
The scheduler plugin for ResourceClaims ("claim plugin" in this section)

0 commit comments

Comments
 (0)