Skip to content

Commit 7fabb16

Browse files
authored
Graduate KEP-2879 PodsReady to stable (#4237)
* Graduate KEP-2879 PodsReady to stable * PRR Remarks * PRR Remarks - batching mechanism * manual tests
1 parent 5844db9 commit 7fabb16

File tree

3 files changed

+72
-35
lines changed

3 files changed

+72
-35
lines changed

keps/prod-readiness/sig-apps/2879.yaml

+3-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,6 @@ kep-number: 2879
22
alpha:
33
approver: "@wojtek-t"
44
beta:
5-
approver: "@wojtek-t"
5+
approver: "@wojtek-t"
6+
stable:
7+
approver: "@wojtek-t"

keps/sig-apps/2879-ready-pods-job-status/README.md

+65-31
Original file line numberDiff line numberDiff line change
@@ -38,17 +38,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
3838
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
3939
- [x] (R) KEP approvers have approved the KEP status as `implementable`
4040
- [x] (R) Design details are appropriately documented
41-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
42-
- [ ] e2e Tests for all Beta API Operations (endpoints)
43-
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
41+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
42+
- [x] e2e Tests for all Beta API Operations (endpoints)
43+
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4444
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
45-
- [ ] (R) Graduation criteria is in place
46-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
45+
- [x] (R) Graduation criteria is in place
46+
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4747
- [x] (R) Production readiness review completed
4848
- [x] (R) Production readiness review approved
49-
- [ ] "Implementation History" section is up-to-date for milestone
49+
- [x] "Implementation History" section is up-to-date for milestone
5050
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
51-
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
51+
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
5252

5353
[kubernetes.io]: https://kubernetes.io/
5454
[kubernetes/enhancements]: https://git.k8s.io/enhancements
@@ -125,6 +125,7 @@ pods that have the `Ready` condition.
125125
- Count of ready pods.
126126
- Feature gate disablement.
127127
- Verify passing existing E2E and conformance tests for Job.
128+
- Added e2e test for the count of ready pods.
128129

129130
### Graduation Criteria
130131

@@ -157,14 +158,14 @@ pods that have the `Ready` condition.
157158
#### GA
158159

159160
- Every bug report is fixed.
160-
- Explore setting different batch periods for regular pod updates versus
161-
finished pod updates, so we can do less pod readiness updates without
162-
compromising how fast we can declare a job finished.
163-
- The job controller ignores the feature gate.
161+
- E2e test for the count of ready pods.
162+
- Lock the feature-gate and document deprecation of the feature-gate
164163

165164
#### Deprecation
166165

167-
N/A
166+
In GA+2 release:
167+
- Remove the feature gate definition
168+
- Job controller ignores the feature gate
168169

169170
### Upgrade / Downgrade Strategy
170171

@@ -210,7 +211,16 @@ The Job controller will start populating the field again.
210211

211212
###### Are there any tests for feature enablement/disablement?
212213

213-
Yes, there are tests at unit and [integration] level.
214+
We have unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)) for
215+
the `status.ready` field when the feature is enabled or disabled.
216+
Similarly, we have integration tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/test/integration/job/job_test.go#L1364)
217+
and [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/test/integration/job/job_test.go#L1517))
218+
for the feature being enabled or disabled.
219+
220+
However, due to omission we graduated to Beta without feature gate
221+
transition (enablement or disablement) tests. With graduation to stable it's too
222+
late to add these tests so we're sticking with just manual tests
223+
(see [here](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested)).
214224

215225
### Rollout, Upgrade and Rollback Planning
216226

@@ -221,20 +231,20 @@ The field is only informative, it doesn't affect running workloads.
221231
###### What specific metrics should inform a rollback?
222232

223233
- An increase in `job_sync_duration_seconds`.
224-
- A reduction in `job_sync_num`.
234+
- A reduction in `job_syncs_total`.
225235

226236
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
227237

228-
A manual test will be performed, as follows:
238+
A manual test on Beta was performed, as follows:
229239

230-
1. Create a cluster in 1.23.
231-
1. Upgrade to 1.24.
232-
1. Create long running Job A, ensure that the ready field is populated.
233-
1. Downgrade to 1.23.
234-
1. Verify that ready field in Job A is not lost, but also not updated.
235-
1. Create long running Job B, ensure that ready field is not populated.
236-
1. Upgrade to 1.24.
237-
1. Verify that Job A and B ready field is tracked again.
240+
1. Create a cluster in 1.28 with the `JobReadyPods` disabled (`=false`).
241+
2. Simulate upgrade by modifying control-plane manifests to enable `JobReadyPods`.
242+
3. Create long running Job A, ensure that the ready field is populated.
243+
4. Simulate downgrade by modifying control-plane manifests to disable `JobReadyPods`.
244+
5. Verify that ready field in Job A is cleaned up shortly after the startup of the Job controller completes.
245+
6. Create long running Job B, ensure that ready field is not populated.
246+
7. Simulate upgrade by modifying control-plane manifests to enable `JobReadyPods`.
247+
8. Verify that Job A and B ready field is tracked again shortly after the startup of the Job controller completes.
238248

239249
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
240250

@@ -259,7 +269,7 @@ the controller doesn't create new Pods or tracks finishing Pods.
259269
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
260270

261271
- [x] Metrics
262-
- Metric name: `job_sync_duration_seconds`, `job_sync_total`.
272+
- Metric name: `job_sync_duration_seconds`, `job_syncs_total`.
263273
- Components exposing the metric: `kube-controller-manager`
264274

265275
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -279,9 +289,18 @@ No.
279289

280290
- API: PUT Job/status
281291

282-
Estimated throughput: at most one API call for each Job Pod reaching Ready
283-
condition.
284-
292+
Estimated throughput: at most one additional API call for each Job Pod reaching
293+
Ready condition per second. The reason is that the update of the `.status.ready`
294+
field triggers another reconciliation of the Job controller.
295+
296+
In order to control the number of reconciliations, the Job controller
297+
batches and deduplicates reconciliation requests within each second.
298+
299+
The mechanism is based on reconciliation delaying queue, where the requests
300+
are added using the `AddAfter` function. If there is another reconciliation
301+
request planned within a second, the one triggered by `.status.ready` update
302+
is skipped.
303+
285304
Originating component: job-controller
286305

287306
###### Will enabling / using this feature result in introducing new API types?
@@ -306,6 +325,10 @@ No.
306325

307326
No.
308327

328+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
329+
330+
No.
331+
309332
### Troubleshooting
310333

311334
###### How does this feature react if the API server and/or etcd is unavailable?
@@ -314,8 +337,7 @@ No change from existing behavior of the Job controller.
314337

315338
###### What are other known failure modes?
316339

317-
- When the cluster has apiservers with skewed versions, the `Job.status.ready`
318-
might remain zero.
340+
No.
319341

320342
###### What steps should be taken if SLOs are not being met to determine the problem?
321343

@@ -332,6 +354,7 @@ No change from existing behavior of the Job controller.
332354

333355
- 2021-08-19: Proposed KEP starting in alpha status, including full PRR questionnaire.
334356
- 2022-01-05: Proposed graduation to beta.
357+
- 2022-03-20: Merged [PR#107476](https://github.com/kubernetes/kubernetes/pull/107476) with beta implementation
335358

336359
## Drawbacks
337360

@@ -346,6 +369,17 @@ Pod created.
346369
to accept connections. On the other hand, the `Ready` condition is
347370
configurable through a readiness probe. If the Pod doesn't have a readiness
348371
probe configured, the `Ready` condition is equivalent to the `Running` phase.
349-
350-
In other words, `Job.status.active` provides as the same behavior as
372+
373+
In other words, `Job.status.ready` provides as the same behavior as
351374
`Job.status.running` with the advantage of it being configurable.
375+
376+
- We considered exploring different batch periods for regular pod updates versus
377+
finished pod updates, so we can do less pod readiness updates without
378+
compromising how fast we can declare a job finished.
379+
380+
However, the feature has been on for a long time already the there were no
381+
bugs or requests raised around the choice of batch period. Moreover, the
382+
introduced batch period was considered an important element of the Job
383+
controller, and is now not guarded by the feature gate since the
384+
[PR#118615](https://github.com/kubernetes/kubernetes/pull/118615) which is
385+
already released in 1.28.

keps/sig-apps/2879-ready-pods-job-status/kep.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,14 @@ approvers:
1515
see-also:
1616
replaces:
1717

18-
stage: beta
18+
stage: stable
1919

20-
latest-milestone: "v1.24"
20+
latest-milestone: "v1.29"
2121

2222
milestone:
2323
alpha: "v1.23"
2424
beta: "v1.24"
25+
stable: "v1.29"
2526

2627
feature-gates:
2728
- name: JobReadyPods
@@ -32,4 +33,4 @@ disable-supported: true
3233

3334
metrics:
3435
- job_sync_duration_seconds
35-
- job_sync_total
36+
- job_syncs_total

0 commit comments

Comments
 (0)