|
12 | 12 | - [Story 2](#story-2)
|
13 | 13 | - [Story 3](#story-3)
|
14 | 14 | - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
|
| 15 | + - [Performance benchmark](#performance-benchmark) |
15 | 16 | - [Risks and Mitigations](#risks-and-mitigations)
|
16 | 17 | - [The Job object too big](#the-job-object-too-big)
|
17 | 18 | - [Expotential backoff delay issue](#expotential-backoff-delay-issue)
|
@@ -226,6 +227,53 @@ spec:
|
226 | 227 |
|
227 | 228 | ### Notes/Constraints/Caveats (Optional)
|
228 | 229 |
|
| 230 | +#### Performance benchmark |
| 231 | +
|
| 232 | +We assess the performance of the Beta implementation in comparison to |
| 233 | +the index jobs with regular `backoffLimit` using the two integration tests |
| 234 | +(`BenchmarkLargeIndexedJob` and `BenchmarkLargeFailureHandling`) |
| 235 | +in the [PR #121393](https://github.com/kubernetes/kubernetes/pull/121393). |
| 236 | + |
| 237 | +In the `BenchmarkLargeIndexedJob` test, the measured part creates N pods |
| 238 | +and marks them as `Succeeded`, awaiting for the Job status to be updated accordingly. |
| 239 | +This is a sanity test for the `backoffLimitPerIndex`, to demonstrate that the |
| 240 | +new branches of code don't have significant performance impact. |
| 241 | + |
| 242 | +Here are the results (lines re-ordered from smallest to the largest N): |
| 243 | +```sh |
| 244 | +go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeIndexedJob" k8s.io/kubernetes/test/integration/job | grep "^Benchmark" |
| 245 | +BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10-48 1 3034342185 ns/op 14391160 B/op 164352 allocs/op |
| 246 | +BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=100-48 1 3050613253 ns/op 111100464 B/op 1324757 allocs/op |
| 247 | +BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=1000-48 1 19382609963 ns/op 1133953568 B/op 13079710 allocs/op |
| 248 | +BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10_000-48 1 222696805443 ns/op 11610639800 B/op 131946944 allocs/op |
| 249 | +BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10-48 1 3025650312 ns/op 14757368 B/op 166282 allocs/op |
| 250 | +BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=100-48 1 3045479158 ns/op 114324072 B/op 1345524 allocs/op |
| 251 | +BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=1000-48 1 19384632203 ns/op 1161105080 B/op 13216319 allocs/op |
| 252 | +BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10_000-48 1 223635439324 ns/op 11911685592 B/op 133325939 allocs/op |
| 253 | +``` |
| 254 | + |
| 255 | +In the `BenchmarkLargeFailureHandling` test, the measured part of the test marks |
| 256 | +N running pods as `Failed` and awaits for the job status to be updated accordingly. |
| 257 | +In order to make the test comparable for regular indexed jobs and with |
| 258 | +backoffLimitPerIndex we set the max backoff delay due to pod failures as 10ms. |
| 259 | +Here are the results (lines re-ordered from smallest to the largest N): |
| 260 | + |
| 261 | +```sh |
| 262 | +go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeFailureHandling" k8s.io/kubernetes/test/integration/job | grep "^Benchmark" |
| 263 | +BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10-48 1 2021272442 ns/op 13813736 B/op 165760 allocs/op |
| 264 | +BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=100-48 1 3036166978 ns/op 109866704 B/op 1310651 allocs/op |
| 265 | +BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=1000-48 1 21049273834 ns/op 1074301144 B/op 12832549 allocs/op |
| 266 | +BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10_000-48 1 202327947010 ns/op 10926201704 B/op 131423197 allocs/op |
| 267 | +BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10-48 1 3016501067 ns/op 14676224 B/op 175301 allocs/op |
| 268 | +BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=100-48 1 3038839798 ns/op 112090728 B/op 1323948 allocs/op |
| 269 | +BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=1000-48 1 21057643253 ns/op 1096364096 B/op 13008669 allocs/op |
| 270 | +BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10_000-48 1 202373728278 ns/op 11185209520 B/op 132578325 allocs/op |
| 271 | +``` |
| 272 | + |
| 273 | +The above results show that the jobs using `.spec.backoffLimitPerIndex` are be |
| 274 | +slower for about 1% compared to regular indexed jobs. In practice the difference |
| 275 | +is expected to be covered by the expotential backoff delay due to pod failures. |
| 276 | + |
229 | 277 | <!--
|
230 | 278 | What are the caveats to the proposal?
|
231 | 279 | What are some important details that didn't come across above?
|
@@ -636,18 +684,11 @@ https://storage.googleapis.com/k8s-triage/index.html
|
636 | 684 | We expect no non-infra related flakes in the last month as a GA graduation criteria.
|
637 | 685 | -->
|
638 | 686 |
|
639 |
| -The following scenarios will be covered with e2e tests: |
640 |
| -- handling of the `.spec.backoffLimitPerIndex` when the `FailIndex` is used - |
641 |
| - the Job's index is marked as failed, |
642 |
| -- handling of the `.spec.backoffLimitPerIndex` when the number of failures for |
643 |
| - an index exceeds the backoff - the index is marked as failed, but the Job |
644 |
| - continues its execution until all indexes are finished. |
645 |
| -- handling of the `.spec.backoffLimitPerIndex` when `.spec.maxFailedIndexes` |
646 |
| - is set and exceeded - the Job is marked as Failed and its execution is |
647 |
| - terminated. |
648 |
| - |
649 |
| -More integration tests might be added to ensure good code coverage based on the |
650 |
| -actual implementation. |
| 687 | +The following scenario is covered with e2e tests for Beta: |
| 688 | +- [sig-apps#gce](https://testgrid.k8s.io/sig-apps#gce): |
| 689 | + - Job should execute all indexes despite some failing when using backoffLimitPerIndex |
| 690 | + - Job should terminate job execution when the number of failed indexes exceeds maxFailedIndexes |
| 691 | + - Job should mark indexes as failed when the FailIndex action is matched in podFailurePolicy |
651 | 692 |
|
652 | 693 | ### Graduation Criteria
|
653 | 694 |
|
|
0 commit comments