Skip to content

Commit b9ae0aa

Browse files
authored
Merge pull request #4321 from mimowo/backoff-limit-per-index-update-kep-beta
Update KEP3850 "BackoffLimitPerIndex for Indexed Jobs"
2 parents eb7f203 + 7cebcdf commit b9ae0aa

File tree

1 file changed

+53
-12
lines changed
  • keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs

1 file changed

+53
-12
lines changed

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

+53-12
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
- [Story 2](#story-2)
1313
- [Story 3](#story-3)
1414
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
15+
- [Performance benchmark](#performance-benchmark)
1516
- [Risks and Mitigations](#risks-and-mitigations)
1617
- [The Job object too big](#the-job-object-too-big)
1718
- [Expotential backoff delay issue](#expotential-backoff-delay-issue)
@@ -226,6 +227,53 @@ spec:
226227
227228
### Notes/Constraints/Caveats (Optional)
228229
230+
#### Performance benchmark
231+
232+
We assess the performance of the Beta implementation in comparison to
233+
the index jobs with regular `backoffLimit` using the two integration tests
234+
(`BenchmarkLargeIndexedJob` and `BenchmarkLargeFailureHandling`)
235+
in the [PR #121393](https://github.com/kubernetes/kubernetes/pull/121393).
236+
237+
In the `BenchmarkLargeIndexedJob` test, the measured part creates N pods
238+
and marks them as `Succeeded`, awaiting for the Job status to be updated accordingly.
239+
This is a sanity test for the `backoffLimitPerIndex`, to demonstrate that the
240+
new branches of code don't have significant performance impact.
241+
242+
Here are the results (lines re-ordered from smallest to the largest N):
243+
```sh
244+
go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeIndexedJob" k8s.io/kubernetes/test/integration/job | grep "^Benchmark"
245+
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10-48 1 3034342185 ns/op 14391160 B/op 164352 allocs/op
246+
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=100-48 1 3050613253 ns/op 111100464 B/op 1324757 allocs/op
247+
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=1000-48 1 19382609963 ns/op 1133953568 B/op 13079710 allocs/op
248+
BenchmarkLargeIndexedJob/regular_indexed_job_without_failures;_size=10_000-48 1 222696805443 ns/op 11610639800 B/op 131946944 allocs/op
249+
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10-48 1 3025650312 ns/op 14757368 B/op 166282 allocs/op
250+
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=100-48 1 3045479158 ns/op 114324072 B/op 1345524 allocs/op
251+
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=1000-48 1 19384632203 ns/op 1161105080 B/op 13216319 allocs/op
252+
BenchmarkLargeIndexedJob/job_with_backoffLimitPerIndex_without_failures;_size=10_000-48 1 223635439324 ns/op 11911685592 B/op 133325939 allocs/op
253+
```
254+
255+
In the `BenchmarkLargeFailureHandling` test, the measured part of the test marks
256+
N running pods as `Failed` and awaits for the job status to be updated accordingly.
257+
In order to make the test comparable for regular indexed jobs and with
258+
backoffLimitPerIndex we set the max backoff delay due to pod failures as 10ms.
259+
Here are the results (lines re-ordered from smallest to the largest N):
260+
261+
```sh
262+
go test -benchmem -run="^$" -timeout=80m -bench "^BenchmarkLargeFailureHandling" k8s.io/kubernetes/test/integration/job | grep "^Benchmark"
263+
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10-48 1 2021272442 ns/op 13813736 B/op 165760 allocs/op
264+
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=100-48 1 3036166978 ns/op 109866704 B/op 1310651 allocs/op
265+
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=1000-48 1 21049273834 ns/op 1074301144 B/op 12832549 allocs/op
266+
BenchmarkLargeFailureHandling/regular_indexed_job_with_failures;_size=10_000-48 1 202327947010 ns/op 10926201704 B/op 131423197 allocs/op
267+
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10-48 1 3016501067 ns/op 14676224 B/op 175301 allocs/op
268+
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=100-48 1 3038839798 ns/op 112090728 B/op 1323948 allocs/op
269+
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=1000-48 1 21057643253 ns/op 1096364096 B/op 13008669 allocs/op
270+
BenchmarkLargeFailureHandling/job_with_backoffLimitPerIndex_with_failures;_size=10_000-48 1 202373728278 ns/op 11185209520 B/op 132578325 allocs/op
271+
```
272+
273+
The above results show that the jobs using `.spec.backoffLimitPerIndex` are be
274+
slower for about 1% compared to regular indexed jobs. In practice the difference
275+
is expected to be covered by the expotential backoff delay due to pod failures.
276+
229277
<!--
230278
What are the caveats to the proposal?
231279
What are some important details that didn't come across above?
@@ -636,18 +684,11 @@ https://storage.googleapis.com/k8s-triage/index.html
636684
We expect no non-infra related flakes in the last month as a GA graduation criteria.
637685
-->
638686

639-
The following scenarios will be covered with e2e tests:
640-
- handling of the `.spec.backoffLimitPerIndex` when the `FailIndex` is used -
641-
the Job's index is marked as failed,
642-
- handling of the `.spec.backoffLimitPerIndex` when the number of failures for
643-
an index exceeds the backoff - the index is marked as failed, but the Job
644-
continues its execution until all indexes are finished.
645-
- handling of the `.spec.backoffLimitPerIndex` when `.spec.maxFailedIndexes`
646-
is set and exceeded - the Job is marked as Failed and its execution is
647-
terminated.
648-
649-
More integration tests might be added to ensure good code coverage based on the
650-
actual implementation.
687+
The following scenario is covered with e2e tests for Beta:
688+
- [sig-apps#gce](https://testgrid.k8s.io/sig-apps#gce):
689+
- Job should execute all indexes despite some failing when using backoffLimitPerIndex
690+
- Job should terminate job execution when the number of failed indexes exceeds maxFailedIndexes
691+
- Job should mark indexes as failed when the FailIndex action is matched in podFailurePolicy
651692

652693
### Graduation Criteria
653694

0 commit comments

Comments
 (0)