Skip to content
This repository was archived by the owner on Jan 16, 2025. It is now read-only.

Commit b2dc794

Browse files
npalmgithub-actions[bot]forest-pr[bot]
authored
feat: add spot termination watcher (beta) (#3789)
This PR is adding a lambda function to watch termination events. - Log instance information for termination warnings - Create optional a metric with dimensions for the environment and instance type. This PR limits to only checking the termination warning. Later we can extend on also start acting on terminations. ## Testing Spot termination can be tested by initiate a termination event via the Spot Request overview (or cli). ## Todo - [x] Write docs - [x] Add to multi runner - [ ] Describe next steps in an issue. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: forest-pr|bot <forest-pr[bot]@users.noreply.github.com>
1 parent 9a9031e commit b2dc794

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1551
-35
lines changed

Diff for: .github/workflows/terraform.yml

+23-2
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ jobs:
3030
touch lambdas/functions/control-plane/runners.zip
3131
touch lambdas/functions/gh-agent-syncer/runner-binaries-syncer.zip
3232
touch lambdas/functions/ami-housekeeper/ami-housekeeper.zip
33+
touch lambdas/functions/termination-watcher/termination-watcher.zip
3334
- name: terraform init
3435
run: terraform init -get -backend=false -input=false
3536
- if: contains(matrix.terraform, '1.5.')
@@ -69,7 +70,18 @@ jobs:
6970
matrix:
7071
terraform: [1.5.6, "latest"]
7172
module:
72-
["ami-housekeeper", "download-lambda", "multi-runner", "runner-binaries-syncer", "runners", "setup-iam-permissions", "ssm", "webhook"]
73+
[
74+
"ami-housekeeper",
75+
"download-lambda",
76+
"lambda",
77+
"multi-runner",
78+
"runner-binaries-syncer",
79+
"runners",
80+
"setup-iam-permissions",
81+
"ssm",
82+
"termination-watcher",
83+
"webhook",
84+
]
7385
defaults:
7486
run:
7587
working-directory: modules/${{ matrix.module }}
@@ -118,7 +130,16 @@ jobs:
118130
matrix:
119131
terraform: [1.5.6, "latest"]
120132
example:
121-
["default", "ubuntu", "prebuilt", "arm64", "ephemeral", "windows", "multi-runner"]
133+
[
134+
"default",
135+
"ubuntu",
136+
"prebuilt",
137+
"arm64",
138+
"ephemeral",
139+
"termination-watcher",
140+
"windows",
141+
"multi-runner",
142+
]
122143
defaults:
123144
run:
124145
working-directory: examples/${{ matrix.example }}

Diff for: .github/workflows/update-docs.yml

+22
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,32 @@ jobs:
1616
name: Auto update terraform docs
1717
runs-on: ubuntu-latest
1818
steps:
19+
- uses: philips-software/app-token-action@9f5d57062c9f2beaffafaa9a34f66f824ead63a9 # v2.0.0
20+
id: app
21+
with:
22+
app_id: ${{ vars.FOREST_PR_BOT_APP_ID }}
23+
app_base64_private_key: ${{ secrets.FOREST_PR_BOT_APP_KEY_BASE64 }}
24+
auth_type: installation
25+
org: philips-labs
26+
1927
- name: Checkout with GITHUB Action token
2028
uses: actions/checkout@9bb56186c3b09b4f86b1c65136769dd318469633 # ratchet:actions/checkout@v4
29+
with:
30+
token: ${{ steps.app.outputs.token }}
2131

32+
# use an app to ensure CI is triggered
2233
- name: Generate TF docs
34+
if: github.repository_owner == 'philips-labs'
35+
uses: terraform-docs/gh-actions@f6d59f89a280fa0a3febf55ef68f146784b20ba0 # ratchet:terraform-docs/[email protected]
36+
with:
37+
find-dir: .
38+
git-commit-message: "docs: auto update terraform docs"
39+
git-push: ${{ github.ref != 'refs/heads/main' || github.repository_owner != 'philips-labs' }}
40+
git-push-user-name: forest-pr|bot
41+
git-push-user-email: "forest-pr[bot]@users.noreply.github.com"
42+
43+
- name: Generate TF docs (forks)
44+
if: github.repository_owner != 'philips-labs'
2345
uses: terraform-docs/gh-actions@f6d59f89a280fa0a3febf55ef68f146784b20ba0 # ratchet:terraform-docs/[email protected]
2446
with:
2547
find-dir: .

Diff for: README.md

+4
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
9898
| Name | Source | Version |
9999
|------|--------|---------|
100100
| <a name="module_ami_housekeeper"></a> [ami\_housekeeper](#module\_ami\_housekeeper) | ./modules/ami-housekeeper | n/a |
101+
| <a name="module_instance_termination_watcher"></a> [instance\_termination\_watcher](#module\_instance\_termination\_watcher) | ./modules/termination-watcher | n/a |
101102
| <a name="module_runner_binaries"></a> [runner\_binaries](#module\_runner\_binaries) | ./modules/runner-binaries-syncer | n/a |
102103
| <a name="module_runners"></a> [runners](#module\_runners) | ./modules/runners | n/a |
103104
| <a name="module_ssm"></a> [ssm](#module\_ssm) | ./modules/ssm | n/a |
@@ -163,6 +164,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
163164
| <a name="input_instance_max_spot_price"></a> [instance\_max\_spot\_price](#input\_instance\_max\_spot\_price) | Max price price for spot instances per hour. This variable will be passed to the create fleet as max spot price for the fleet. | `string` | `null` | no |
164165
| <a name="input_instance_profile_path"></a> [instance\_profile\_path](#input\_instance\_profile\_path) | The path that will be added to the instance\_profile, if not set the environment name will be used. | `string` | `null` | no |
165166
| <a name="input_instance_target_capacity_type"></a> [instance\_target\_capacity\_type](#input\_instance\_target\_capacity\_type) | Default lifecycle used for runner instances, can be either `spot` or `on-demand`. | `string` | `"spot"` | no |
167+
| <a name="input_instance_termination_watcher"></a> [instance\_termination\_watcher](#input\_instance\_termination\_watcher) | Configuration for the instance termination watcher. This feature is Beta, changes will not trigger a major release as long in beta.<br><br>`enable`: Enable or disable the spot termination watcher.<br>'enable\_metrics': Enable or disable the metrics for the spot termination watcher.<br>`memory_size`: Memory size linit in MB of the lambda.<br>`s3_key`: S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas.<br>`s3_object_version`: S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket.<br>`timeout`: Time out of the lambda in seconds.<br>`zip`: File location of the lambda zip file. | <pre>object({<br> enable = optional(bool, false)<br> enable_metric = optional(object({<br> spot_warning = optional(bool, false)<br> }))<br> memory_size = optional(number, null)<br> s3_key = optional(string, null)<br> s3_object_version = optional(string, null)<br> timeout = optional(number, null)<br> zip = optional(string, null)<br> })</pre> | `{}` | no |
166168
| <a name="input_instance_types"></a> [instance\_types](#input\_instance\_types) | List of instance types for the action runner. Defaults are based on runner\_os (al2023 for linux and Windows Server Core for win). | `list(string)` | <pre>[<br> "m5.large",<br> "c5.large"<br>]</pre> | no |
167169
| <a name="input_job_queue_retention_in_seconds"></a> [job\_queue\_retention\_in\_seconds](#input\_job\_queue\_retention\_in\_seconds) | The number of seconds the job is held in the queue before it is purged. | `number` | `86400` | no |
168170
| <a name="input_key_name"></a> [key\_name](#input\_key\_name) | Key pair name | `string` | `null` | no |
@@ -177,6 +179,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
177179
| <a name="input_log_level"></a> [log\_level](#input\_log\_level) | Logging level for lambda logging. Valid values are 'silly', 'trace', 'debug', 'info', 'warn', 'error', 'fatal'. | `string` | `"info"` | no |
178180
| <a name="input_logging_kms_key_id"></a> [logging\_kms\_key\_id](#input\_logging\_kms\_key\_id) | Specifies the kms key id to encrypt the logs with. | `string` | `null` | no |
179181
| <a name="input_logging_retention_in_days"></a> [logging\_retention\_in\_days](#input\_logging\_retention\_in\_days) | Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653. | `number` | `180` | no |
182+
| <a name="input_metrics_namespace"></a> [metrics\_namespace](#input\_metrics\_namespace) | The namespace for the metrics created by the module. Merics will only be created if explicit enabled. | `string` | `"GitHub Runners"` | no |
180183
| <a name="input_minimum_running_time_in_minutes"></a> [minimum\_running\_time\_in\_minutes](#input\_minimum\_running\_time\_in\_minutes) | The time an ec2 action runner should be running at minimum before terminated, if not busy. | `number` | `null` | no |
181184
| <a name="input_pool_config"></a> [pool\_config](#input\_pool\_config) | The configuration for updating the pool. The `pool_size` to adjust to by the events triggered by the `schedule_expression`. For example you can configure a cron expression for weekdays to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1. | <pre>list(object({<br> schedule_expression = string<br> size = number<br> }))</pre> | `[]` | no |
182185
| <a name="input_pool_lambda_memory_size"></a> [pool\_lambda\_memory\_size](#input\_pool\_lambda\_memory\_size) | Memory size limit for scale-up lambda. | `number` | `512` | no |
@@ -248,6 +251,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
248251
| Name | Description |
249252
|------|-------------|
250253
| <a name="output_binaries_syncer"></a> [binaries\_syncer](#output\_binaries\_syncer) | n/a |
254+
| <a name="output_instance_termination_watcher"></a> [instance\_termination\_watcher](#output\_instance\_termination\_watcher) | n/a |
251255
| <a name="output_queues"></a> [queues](#output\_queues) | SQS queues. |
252256
| <a name="output_runners"></a> [runners](#output\_runners) | n/a |
253257
| <a name="output_ssm_parameters"></a> [ssm\_parameters](#output\_ssm\_parameters) | n/a |

Diff for: docs/configuration.md

+37
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,11 @@ This tracing config generates timelines for following events:
175175

176176
This feature has been disabled by default.
177177

178+
### Multiple runner module in your AWS account
179+
180+
The watcher will act on all spot termination notificatins and log all onses relevant to the runner module. Therefor we suggest to only deploy the watcher once. You can either deploy the watcher by enabling in one of your deployments or deploy the watcher as a stand alone module.
181+
182+
178183
## Debugging
179184

180185
In case the setup does not work as intended, trace the events through this sequence:
@@ -187,6 +192,38 @@ In case the setup does not work as intended, trace the events through this seque
187192

188193
## Experimental features
189194

195+
### Termination watcher
196+
197+
This feature is in early stage and therefore disabled by default.
198+
199+
The termination watcher is currently watching for spot termination notifications. The module is only taken events into account for instances tagged with `ghr:environment` by default when deployment the module as part of one of the main modules (root or multi-runner). The module can also be deployed stand-alone, in that case the tag filter needs to be tunned.
200+
201+
- Logs: The module will log all termination notifications. For each warning it will look up instance details and log the environment, instance type and time the instance is running. As well some other details.
202+
- Metrics: Metrics are disabled by default, this to avoid costs. Once enabled a metric will be created for each warning with at least dimensions for the environment and instance type. THe metric name space can be configured via the variables. The metric name used is `SpotInterruptionWarning`.
203+
204+
#### Log example
205+
206+
Below an example of the the log messages created.
207+
208+
```
209+
{
210+
"level": "INFO",
211+
"message": "Received spot notification warning:",
212+
"environment": "default",
213+
"instanceId": "i-0039b8826b3dcea55",
214+
"instanceType": "c5.large",
215+
"instanceLaunchTime": "2024-03-15T08:10:34.000Z",
216+
"instanceRunningTimeInSeconds": 68,
217+
"tags": [
218+
{
219+
"Key": "ghr:environment",
220+
"Value": "default"
221+
}
222+
... all tags ...
223+
]
224+
}
225+
```
226+
190227
### Queue to publish workflow job events
191228

192229
This queue is an experimental feature to allow you to receive a copy of the wokflow_jobs events sent by the GitHub App. This can be used to calculate a matrix or monitor the system.

Diff for: docs/examples/termination-watcher.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
--8<-- "examples/termination-watcher/README.md"

Diff for: docs/index.md

+5
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,11 @@ The control plane (scale up lambda) will store the runner registration configura
6464

6565
The AMI cleaner is a lambda that will clean up AMIs that are older than a configurable amount of days. This is useful when using the AMI builder to create AMIs. The cleaner will also check which AMIs are used the latest version of the launch template. And you can provide SSM config paths pointing to AMI IDs. The cleaner will not delete these AMIs. The AMI cleaner is opt in, it will not be created by default.
6666

67+
### Instance Termination Watcher
68+
69+
> This feature is Beta, changes will not trigger a major release as long in beta.
70+
71+
The Instance Termination Watcher is creating log and optional metrics for termination of instances. Currently only spot termination warnings are watched. See [configuration](configuration/) for more details.
6772

6873
### Security
6974

Diff for: examples/default/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ terraform output -raw webhook_secret
6262

6363
| Name | Description | Type | Default | Required |
6464
|------|-------------|------|---------|:--------:|
65+
| <a name="input_aws_region"></a> [aws\_region](#input\_aws\_region) | AWS region. | `string` | `"eu-west-1"` | no |
6566
| <a name="input_environment"></a> [environment](#input\_environment) | Environment name, used as prefix. | `string` | `null` | no |
6667
| <a name="input_github_app"></a> [github\_app](#input\_github\_app) | GitHub for API usages. | <pre>object({<br> id = string<br> key_base64 = string<br> })</pre> | n/a | yes |
6768

Diff for: examples/default/main.tf

+9-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
locals {
22
environment = var.environment != null ? var.environment : "default"
3-
aws_region = "eu-west-1"
3+
aws_region = var.aws_region
44
}
55

66
resource "random_id" "random" {
@@ -79,7 +79,7 @@ module "runners" {
7979

8080
# override delay of events in seconds
8181
delay_webhook_event = 5
82-
runners_maximum_count = 1
82+
runners_maximum_count = 2
8383

8484
# set up a fifo queue to remain order
8585
enable_fifo_build_queue = true
@@ -109,6 +109,13 @@ module "runners" {
109109
]
110110
}
111111

112+
instance_termination_watcher = {
113+
enable = true
114+
enable_metric = {
115+
spot_warning = true
116+
}
117+
}
118+
112119
}
113120

114121
module "webhook_github_app" {

Diff for: examples/default/variables.tf

+7
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,10 @@ variable "environment" {
1313
type = string
1414
default = null
1515
}
16+
17+
variable "aws_region" {
18+
description = "AWS region."
19+
20+
type = string
21+
default = "eu-west-1"
22+
}

Diff for: examples/lambdas-download/main.tf

+8
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,14 @@ module "lambdas" {
1212
{
1313
name = "runner-binaries-syncer"
1414
tag = var.module_version
15+
},
16+
{
17+
name = "ami-housekeeper"
18+
tag = var.module_version
19+
},
20+
{
21+
name = "termination-watcher"
22+
tag = var.module_version
1523
}
1624
]
1725
}

Diff for: examples/multi-runner/.terraform.lock.hcl

+16-16
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Diff for: examples/multi-runner/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ terraform output -raw webhook_secret
8080

8181
| Name | Description | Type | Default | Required |
8282
|------|-------------|------|---------|:--------:|
83+
| <a name="input_aws_region"></a> [aws\_region](#input\_aws\_region) | AWS region to deploy to | `string` | `"eu-west-1"` | no |
8384
| <a name="input_environment"></a> [environment](#input\_environment) | Environment name, used as prefix | `string` | `null` | no |
8485
| <a name="input_github_app"></a> [github\_app](#input\_github\_app) | GitHub for API usages. | <pre>object({<br> id = string<br> key_base64 = string<br> })</pre> | n/a | yes |
8586

Diff for: examples/multi-runner/main.tf

+14-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
locals {
22
environment = var.environment != null ? var.environment : "multi-runner"
3-
aws_region = "eu-west-1"
3+
aws_region = var.aws_region
44

55
# Load runner configurations from Yaml files
66
multi_runner_config_files = {
@@ -94,6 +94,19 @@ module "runners" {
9494

9595
# Enable debug logging for the lambda functions
9696
# log_level = "debug"
97+
98+
# Enable spot termination watcher
99+
# spot_instance_termination_watcher = {
100+
# enable = true
101+
# }
102+
103+
# Enable to track the spot instance termination warning
104+
# instance_termination_watcher = {
105+
# enable = true
106+
# enable_metric = {
107+
# spot_warning = true
108+
# }
109+
# }
97110
}
98111

99112
module "webhook_github_app" {

Diff for: examples/multi-runner/variables.tf

+7
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,10 @@ variable "environment" {
1313
type = string
1414
default = null
1515
}
16+
17+
variable "aws_region" {
18+
description = "AWS region to deploy to"
19+
20+
type = string
21+
default = "eu-west-1"
22+
}

Diff for: examples/termination-watcher/.terraform.lock.hcl

+25
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)