Skip to content

Commit 749b49b

Browse files
authored
Merge pull request #3965 from mimowo/pod-failure-policy-kep-update
Beta update for Kubelet preemption (1.28) "KEP-3329: Retriable and non-retriable Pod failures for Jobs"
2 parents ddf81f6 + 89496a4 commit 749b49b

File tree

2 files changed

+16
-3
lines changed

2 files changed

+16
-3
lines changed

keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md

+15-2
Original file line numberDiff line numberDiff line change
@@ -1279,7 +1279,7 @@ condition makes it easier to determine if a failed pod should be restarted):
12791279
- DeletionByTaintManager (Pod evicted by kube-controller-manager due to taints)
12801280
- EvictionByEvictionAPI (Pod deleted by Eviction API)
12811281
- DeletionByPodGC (an orphaned Pod deleted by pod GC)
1282-
- TerminationByKubelet (Pod terminated due to graceful node shutdown or node resource pressure).
1282+
- TerminationByKubelet (Pod terminated due to graceful node shutdown, node resource pressure, or Kubelet preemption for critical pods).
12831283

12841284
The already existing `status.conditions` field in Pod will be used by kubernetes
12851285
components to append a dedicated condition.
@@ -1713,6 +1713,10 @@ Second iteration:
17131713
- Extend the feature documentation to explain transitioning of pending and
17141714
terminating pods into `Failed` phase.
17151715

1716+
Third iteration (1.28):
1717+
- Add `DisruptionTarget` condition for pods which are preempted by Kubelet to make room for critical pods.
1718+
Also, backport this fix to 1.26 and 1.27 release branches, and update the user-facing documentation to reflect this change.
1719+
17161720
#### GA
17171721

17181722
- Address reviews and bug reports from Beta users
@@ -2250,7 +2254,13 @@ No change from existing behavior of the Job controller.
22502254
- Detection: Observe that the pods are not deleted when a node is tainted with `NoExecute`
22512255
- Mitigations: disable `PodDisruptionConditions`
22522256
- Testing: Discovered bugs are covered by unit and integration tests.
2253-
2257+
- `DisruptionTarget` condition is not added to pods preempted by Kubelet when scheduling a critical pod. As a consequence
2258+
there is no way to handle such pod failures with pod failure policy.
2259+
- Known bug in 1.26.0-5 and 1.27.0-2
2260+
- Bugs: described in [Add DisruptionTarget condition when preempting for critical pod](https://github.com/kubernetes/kubernetes/pull/117586)
2261+
- Detection: Observe failed pods with reason `Preempting`, and message `Preempted in order to admit critical pod`, but without `DisruptionTarget` condition.
2262+
- Mitigations: upgrade to a fixed version (1.26.6+, 1.27.3+ or 1.28+). Alternatively, set higher `backoffLimit` for Jobs.
2263+
- Testing: Discovered bug is covered by an integration test.
22542264
<!--
22552265
For each of them, fill in the following information by copying the below template:
22562266
- [Failure mode brief description]
@@ -2327,6 +2337,9 @@ technics apply):
23272337
- 2023-01-03: PR "Fix clearing of rate-limiter for the queue of checks for cleaning stale pod disruption conditions" ([link](https://github.com/kubernetes/kubernetes/pull/114770))
23282338
- 2023-01-09: PR "Adjust DisruptionTarget condition message to do not include preemptor pod metadata" ([link](https://github.com/kubernetes/kubernetes/pull/114914))
23292339
- 2023-01-13: PR "PodGC should not add DisruptionTarget condition for pods which are in terminal phase" ([link](https://github.com/kubernetes/kubernetes/pull/115056))
2340+
- 2023-03-17: PR "Give terminal phase correctly to all pods that will not be restarted" ([link](https://github.com/kubernetes/kubernetes/pull/115331))
2341+
- 2023-03-18: PR "API-initiated eviction: handle deleteOptions correctly" ([link](https://github.com/kubernetes/kubernetes/pull/116554))
2342+
- 2023-05-23: PR "Add DisruptionTarget condition when preempting for critical pod" ([link](https://github.com/kubernetes/kubernetes/pull/117586))
23302343

23312344
<!--
23322345
Major milestones in the lifecycle of a KEP should be tracked in this section.

keps/sig-apps/3329-retriable-and-non-retriable-failures/kep.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ stage: beta
2424
# The most recent milestone for which work toward delivery of this KEP has been
2525
# done. This can be the current (upcoming) milestone, if it is being actively
2626
# worked on.
27-
latest-milestone: "v1.27"
27+
latest-milestone: "v1.28"
2828

2929
# The milestone at which this feature was, or is targeted to be, at each stage.
3030
milestone:

0 commit comments

Comments
 (0)