You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Feature Enablement and Rollback](#feature-enablement-and-rollback)
@@ -164,7 +166,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
164
166
-[x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
165
167
-[x] (R) KEP approvers have approved the KEP status as `implementable`
166
168
-[x] (R) Design details are appropriately documented
167
-
-[] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
169
+
-[x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
168
170
-[ ] e2e Tests for all Beta API Operations (endpoints)
169
171
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
170
172
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
@@ -594,9 +596,34 @@ enhancement:
594
596
- What changes (in invocations, configurations, API use, etc.) is an existing
595
597
cluster required to make on upgrade, in order to make use of the enhancement?
596
598
-->
599
+
#### Upgrade
600
+
601
+
Set `JobPodReplacementPolicy` to true in apiserver and controller manager.
602
+
603
+
There are no other components required.
604
+
605
+
Jobs that want to replace pods once they are fully terminal can use `PodReplacementPolicy`: `Failed`.
606
+
607
+
If a Job is not using `PodFailurePolicy`, one can change `PodReplacementPolicy` to `terminatingOrFailed`. This will revert Jobs to existing behavior with the feature off.
608
+
609
+
If one is using `PodFailurePolicy`, one will not be able to set the value to `terminatingOrFailed` as `Failed` is the only allowable solution.
610
+
In this case, the recommendation would be to disable the `PodFailurePolicy` feature also.
611
+
612
+
#### Downgrade
613
+
614
+
Set `JobPodReplacementPolicy` to false in apiserver and controller manager.
615
+
616
+
With downgrading, you will no longer see any side-effects of `PodReplacementPolicy`.
597
617
598
618
### Version Skew Strategy
599
619
620
+
This feature is limited to control plane.
621
+
622
+
Note that, kube-apiserver can be in the N+1 skew version relative to the
623
+
kube-controller-manager (see [here](https://kubernetes.io/releases/version-skew-policy/#kube-controller-manager-kube-scheduler-and-cloud-controller-manager)).
624
+
In that case, the Job controller operates on the version of the Job object that
625
+
already supports the new Job API.
626
+
600
627
<!--
601
628
If applicable, how will the component handle version skew with other
602
629
components? What are the guarantees? Make sure this is in the test plan.
@@ -708,6 +735,21 @@ This section must be completed when targeting beta to a release.
708
735
709
736
#### How can a rollout or rollback fail? Can it impact already running workloads?
710
737
738
+
A rollout or rollback will not fail as rolling out this feature entails turning on `JobPodReplacementPolicy`.
739
+
Failure rates of the Jobs will not increase or decrease on this feature. Pods will be marked as failed later (as we wait for the pods to be fully terminal)
740
+
741
+
This feature is opt-in for functional changes. We track terminating pods for observability reasons but we only use this data in the case of `Failed`.
742
+
743
+
If a user has set `PodReplacementPolicy: Failed` or has PodFailurePolicy set, then
744
+
rollbacking this feature would mean that terminating Pods will be recreated once they are deleted.
745
+
746
+
If a user rollouts this feature with `PodFailurePolicy` or `PodReplacementPolicy` set to `Failed`,
747
+
then pods will only recreate once they are fully terminal.
748
+
This will not impact failure counts as in both cases, they will get marked as failed eventually.
749
+
750
+
If a user rollouts this feature without `PodFailurePolicy` or `PodReplacementPolicy` set, then there will be no impact to existing workloads.
751
+
752
+
711
753
<!--
712
754
Try to be as paranoid as possible - e.g., what if some components will restart
713
755
mid-rollout?
@@ -729,6 +771,112 @@ that might indicate a serious problem?
729
771
730
772
#### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
731
773
774
+
In beta, we are working on adding an [integration test](https://github.com/kubernetes/kubernetes/pull/119912) for these cases.
775
+
776
+
In terms of a manual test for upgrade and rollback, we can use 1.28.
777
+
778
+
The Upgrade->downgrade->upgrade testing was done manually using the `alpha`
779
+
version in 1.28 with the following steps:
780
+
781
+
1. Start the cluster with the `JobPodReplacementPolicy` enabled:
782
+
783
+
Create a KIND cluster with 1.28 and use the config below to turn this feature on.
784
+
785
+
using `config.yaml`:
786
+
787
+
```yaml
788
+
kind: Cluster
789
+
apiVersion: kind.x-k8s.io/v1alpha4
790
+
featureGates:
791
+
"JobPodReplacementPolicy": true
792
+
nodes:
793
+
- role: control-plane
794
+
- role: worker
795
+
```
796
+
797
+
Then, create the job using `.spec.podReplacementPolicy=Failed`:
798
+
799
+
```sh
800
+
kubectl create -f job.yaml
801
+
```
802
+
803
+
using `job.yaml`:
804
+
805
+
```yaml
806
+
apiVersion: batch/v1
807
+
kind: Job
808
+
metadata:
809
+
name: job-prp
810
+
spec:
811
+
completions: 1
812
+
parallelism: 1
813
+
backoffLimit: 2
814
+
podReplacementPolicy: Failed
815
+
template:
816
+
spec:
817
+
restartPolicy: Never
818
+
containers:
819
+
- name: sleep
820
+
image: gcr.io/k8s-staging-perf-tests/sleep
821
+
args: ["-termination-grace-period", "1m", "60s"]
822
+
823
+
```
824
+
825
+
Await for the pods to be running and delete a pod:
826
+
827
+
```sh
828
+
kubectl delete pods -l job-name=job-prp
829
+
```
830
+
831
+
With feature on and `PodReplacementPolicy` set to Failed, the replacement pod will be recreated once the pod was fully terminated.
832
+
While the pod is terminating you can also see the status report a terminating pod.
833
+
834
+
```sh
835
+
kubectl get jobs -ljob-name=job-prp -oyaml
836
+
```
837
+
838
+
```yaml
839
+
status:
840
+
terminating: 1
841
+
```
842
+
843
+
2. Simulate downgrade by creating a new `Kind` cluster with the feature turned off.
844
+
845
+
```yaml
846
+
kind: Cluster
847
+
apiVersion: kind.x-k8s.io/v1alpha4
848
+
featureGates:
849
+
"JobPodReplacementPolicy": false
850
+
nodes:
851
+
- role: control-plane
852
+
- role: worker
853
+
```
854
+
855
+
Then, deleting the pods of the job.
856
+
857
+
```sh
858
+
kubectl delete pods -l job-name=job-prp
859
+
```
860
+
861
+
There should also be no terminating pod status and a pod will be created before the other pod terminates. If you use the above case, you should see a terminating pod and a new pod created.
862
+
863
+
3. Simulate upgrade by creating a new `Kind` cluster with the feature turned on.
864
+
865
+
```yaml
866
+
kind: Cluster
867
+
apiVersion: kind.x-k8s.io/v1alpha4
868
+
featureGates:
869
+
"JobPodReplacementPolicy": true
870
+
nodes:
871
+
- role: control-plane
872
+
- role: worker
873
+
```
874
+
875
+
Deleting the pod will create a replacement pod once the pod is fully terminated.
876
+
The status field will also state that the pod is terminating.
877
+
878
+
This demonstrates that the feature is working again for the job.
879
+
732
880
<!--
733
881
Describe manual testing that was done and the outcomes.
734
882
Longer term, we may want to require automated upgrade/rollback tests, but we
@@ -775,20 +923,7 @@ When feature is turned on, we will also include a `terminating` field in the Job
775
923
776
924
#### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
777
925
778
-
<!--
779
-
This is your opportunity to define what "normal" quality of service looks like
780
-
for a feature.
781
-
782
-
It's impossible to provide comprehensive guidance, but at the very
783
-
high level (needs more precise definitions) those may be things like:
784
-
- per-day percentage of API calls finishing with 5XX errors <= 1%
785
-
- 99% percentile over day of absolute value from (job creation time minus expected
786
-
job creation time) for cron job <= 10%
787
-
- 99.9% of /health requests per day finish with 200 code
788
-
789
-
These goals will help you determine what you need to measure (SLIs) in the next
790
-
question.
791
-
-->
926
+
We did not propose any SLO/SLI for this feature.
792
927
793
928
#### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
794
929
@@ -800,6 +935,8 @@ feature enablement causes the number of syncs to increase.
800
935
801
936
#### Are there any missing metrics that would be useful to have to improve observability of this feature?
802
937
938
+
In beta, we will add a new metric `job_pods_creation_total`.
939
+
803
940
### Dependencies
804
941
805
942
In [Risks and Mitigations](#risks-and-mitigations) we discuss the interaction with [3329-retriable-and-non-retriable-failures](https://github.com/kubernetes/enhancements/blob/master/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md).
@@ -817,7 +954,9 @@ to create new pods until the existing ones are terminated.
817
954
818
955
#### Will enabling / using this feature result in any new API calls?
819
956
820
-
No
957
+
In the job controller, we only update the Job.Status if any field in the `Job.Status` changes. With this feature on, we will track `terminating` pods in this status.
958
+
It could be possible to see an increase in updating the status field of Jobs if a lot of the pods are being terminated.
959
+
However, if pods are being terminated, we would also expect other fields to be getting updated also (active, failed, etc) so there should not be a large increase of API calls for patching.
821
960
822
961
#### Will enabling / using this feature result in introducing new API types?
823
962
@@ -899,13 +1038,21 @@ No change from existing behavior of the Job controller.
899
1038
900
1039
#### What are other known failure modes?
901
1040
1041
+
There are no other failure modes.
1042
+
902
1043
#### What steps should be taken if SLOs are not being met to determine the problem?
903
1044
1045
+
One could disable this feature.
1046
+
1047
+
Or if one wants to keep the feature on and they could suspend the jobs that are using this feature.
1048
+
Setting `Suspend:True` in your JobSpec will halt the execution of all jobs.
0 commit comments