Skip to content

Commit b475709

Browse files
committed
feat(generic-worker): fine-tune auto-abortion of tasks
1 parent cf4c8f2 commit b475709

22 files changed

+338
-164
lines changed

changelog/issue-7769.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
audience: users
2+
level: patch
3+
reference: issue 7769
4+
---
5+
Generic Worker: resource monitor will print out its usage summary after aborting the task.

changelog/issue-7770.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
audience: worker-deployers
2+
level: minor
3+
reference: issue 7770
4+
---
5+
Generic Worker: adds additional resource monitoring auto-abortion configuration to better fine-tune how your worker aborts running task processes.
6+
7+
* `absoluteHighMemoryThreshold`: The minimum amount of available memory (in bytes) required before considering task abortion. If available memory drops below this value, it may trigger an abort. Default: `524288000` (500MiB).
8+
* `relativeHighMemoryThreshold`: The percentage of total system memory usage that, if exceeded, contributes to the decision to abort the task. Default: `90`.
9+
* `allowedHighMemoryDurationSecs`: The maximum duration (in seconds) that high memory usage conditions can persist before the task is aborted. Default: `5`.
10+
11+
Generic Worker will auto-abort a task if the total system memory used percentage is greater than `relativeHighMemoryThreshold` _AND_ the available memory is less than `absoluteHighMemoryThreshold` for longer than `allowedHighMemoryDurationSecs`, unless `disableOOMProtection` is enabled.

generated/references.json

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

tools/d2g/genericworker/generated_types.go

Lines changed: 10 additions & 8 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

ui/docs/reference/workers/generic-worker/usage.mdx

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,25 @@ and reports back results to the queue.
113113
** OPTIONAL ** properties
114114
=========================
115115

116+
absoluteHighMemoryThreshold Number of bytes the resource monitor uses to
117+
determine when to abort a task due to high
118+
memory usage. The is an absolute number of bytes
119+
needed of available memory before aborting the task.
120+
For example, if the value is 524288000, then the worker will
121+
abort a task if the memory available is 200MiB or less
122+
for longer than allowedHighMemoryDurationSecs seconds.
123+
Can be used in conjunction with relativeHighMemoryThreshold.
124+
Does nothing if disableOOMProtection is set to true.
125+
[default: 524288000] (500MiB)
126+
allowedHighMemoryDurationSecs The number of seconds the resource monitor will
127+
allow the system memory usage to be above the high
128+
memory thresholds (see absoluteHighMemoryThreshold
129+
and relativeHighMemoryThreshold) before aborting
130+
the task. If the memory usage is above the high
131+
memory thresholds for longer than this time, the
132+
worker will abort the task. Does nothing if
133+
disableOOMProtection is set to true.
134+
[default: 5]
116135
availabilityZone The EC2 availability zone of the worker.
117136
cachesDir The directory where task caches should be stored on
118137
the worker. The directory will be created if it does
@@ -160,8 +179,9 @@ and reports back results to the queue.
160179
[default: false]
161180
disableOOMProtection If true, the worker will continue to monitor system
162181
memory usage, but will not abort tasks when the
163-
system memory usage is at 90% or higher for five
164-
consecutive measurements at 0.5s intervals.
182+
system memory usage hits the absoluteHighMemoryThreshold
183+
AND relativeHighMemoryThreshold for longer than
184+
allowedHighMemoryDurationSecs seconds.
165185
[default: false]
166186
downloadsDir The directory to cache downloaded files for
167187
populating preloaded caches and readonly mounts. The
@@ -258,6 +278,17 @@ and reports back results to the queue.
258278
publicIP The IP address for VNC access. Also used by chain of
259279
trust when present.
260280
region The EC2 region of the worker. Used by chain of trust.
281+
relativeHighMemoryThreshold A percent used by the resource monitor to determine
282+
when to abort a task due to high memory usage.
283+
This is a relative value, meaning that it is
284+
relative to the total memory available on the
285+
worker. For example, if the value is 90, then
286+
the worker will abort a task if the memory
287+
usage is at 90% or higher for longer than
288+
allowedHighMemoryDurationSecs seconds. Can be used
289+
in conjunction with absoluteHighMemoryThreshold.
290+
Does nothing if disableOOMProtection is set to true.
291+
[default: 90]
261292
requiredDiskSpaceMegabytes The garbage collector will ensure at least this
262293
number of megabytes of disk space are available
263294
when each task starts. If it cannot free enough

workers/generic-worker/README.md

Lines changed: 33 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,25 @@ and reports back results to the queue.
114114
** OPTIONAL ** properties
115115
=========================
116116
117+
absoluteHighMemoryThreshold Number of bytes the resource monitor uses to
118+
determine when to abort a task due to high
119+
memory usage. The is an absolute number of bytes
120+
needed of available memory before aborting the task.
121+
For example, if the value is 524288000, then the worker will
122+
abort a task if the memory available is 200MiB or less
123+
for longer than allowedHighMemoryDurationSecs seconds.
124+
Can be used in conjunction with relativeHighMemoryThreshold.
125+
Does nothing if disableOOMProtection is set to true.
126+
[default: 524288000] (500MiB)
127+
allowedHighMemoryDurationSecs The number of seconds the resource monitor will
128+
allow the system memory usage to be above the high
129+
memory thresholds (see absoluteHighMemoryThreshold
130+
and relativeHighMemoryThreshold) before aborting
131+
the task. If the memory usage is above the high
132+
memory thresholds for longer than this time, the
133+
worker will abort the task. Does nothing if
134+
disableOOMProtection is set to true.
135+
[default: 5]
117136
availabilityZone The EC2 availability zone of the worker.
118137
cachesDir The directory where task caches should be stored on
119138
the worker. The directory will be created if it does
@@ -161,8 +180,9 @@ and reports back results to the queue.
161180
[default: false]
162181
disableOOMProtection If true, the worker will continue to monitor system
163182
memory usage, but will not abort tasks when the
164-
system memory usage is at 90% or higher for five
165-
consecutive measurements at 0.5s intervals.
183+
system memory usage hits the absoluteHighMemoryThreshold
184+
AND relativeHighMemoryThreshold for longer than
185+
allowedHighMemoryDurationSecs seconds.
166186
[default: false]
167187
downloadsDir The directory to cache downloaded files for
168188
populating preloaded caches and readonly mounts. The
@@ -259,6 +279,17 @@ and reports back results to the queue.
259279
publicIP The IP address for VNC access. Also used by chain of
260280
trust when present.
261281
region The EC2 region of the worker. Used by chain of trust.
282+
relativeHighMemoryThreshold A percent used by the resource monitor to determine
283+
when to abort a task due to high memory usage.
284+
This is a relative value, meaning that it is
285+
relative to the total memory available on the
286+
worker. For example, if the value is 90, then
287+
the worker will abort a task if the memory
288+
usage is at 90% or higher for longer than
289+
allowedHighMemoryDurationSecs seconds. Can be used
290+
in conjunction with absoluteHighMemoryThreshold.
291+
Does nothing if disableOOMProtection is set to true.
292+
[default: 90]
262293
requiredDiskSpaceMegabytes The garbage collector will ensure at least this
263294
number of megabytes of disk space are available
264295
when each task starts. If it cannot free enough

workers/generic-worker/generated_insecure_darwin.go

Lines changed: 10 additions & 8 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)