You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR is adding a lambda function to watch termination events.
- Log instance information for termination warnings
- Create optional a metric with dimensions for the environment and
instance type.
This PR limits to only checking the termination warning. Later we can
extend on also start acting on terminations.
## Testing
Spot termination can be tested by initiate a termination event via the
Spot Request overview (or cli).
## Todo
- [x] Write docs
- [x] Add to multi runner
- [ ] Describe next steps in an issue.
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: forest-pr|bot <forest-pr[bot]@users.noreply.github.com>
@@ -163,6 +164,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
163
164
| <aname="input_instance_max_spot_price"></a> [instance\_max\_spot\_price](#input\_instance\_max\_spot\_price)| Max price price for spot instances per hour. This variable will be passed to the create fleet as max spot price for the fleet. |`string`|`null`| no |
164
165
| <aname="input_instance_profile_path"></a> [instance\_profile\_path](#input\_instance\_profile\_path)| The path that will be added to the instance\_profile, if not set the environment name will be used. |`string`|`null`| no |
165
166
| <aname="input_instance_target_capacity_type"></a> [instance\_target\_capacity\_type](#input\_instance\_target\_capacity\_type)| Default lifecycle used for runner instances, can be either `spot` or `on-demand`. |`string`|`"spot"`| no |
167
+
| <a name="input_instance_termination_watcher"></a> [instance\_termination\_watcher](#input\_instance\_termination\_watcher) | Configuration for the instance termination watcher. This feature is Beta, changes will not trigger a major release as long in beta.<br><br>`enable`: Enable or disable the spot termination watcher.<br>'enable\_metrics': Enable or disable the metrics for the spot termination watcher.<br>`memory_size`: Memory size linit in MB of the lambda.<br>`s3_key`: S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas.<br>`s3_object_version`: S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket.<br>`timeout`: Time out of the lambda in seconds.<br>`zip`: File location of the lambda zip file. | <pre>object({<br> enable = optional(bool, false)<br> enable_metric = optional(object({<br> spot_warning = optional(bool, false)<br> }))<br> memory_size = optional(number, null)<br> s3_key = optional(string, null)<br> s3_object_version = optional(string, null)<br> timeout = optional(number, null)<br> zip = optional(string, null)<br> })</pre> | `{}` | no |
166
168
| <aname="input_instance_types"></a> [instance\_types](#input\_instance\_types)| List of instance types for the action runner. Defaults are based on runner\_os (al2023 for linux and Windows Server Core for win). |`list(string)`| <pre>[<br> "m5.large",<br> "c5.large"<br>]</pre> | no |
167
169
| <aname="input_job_queue_retention_in_seconds"></a> [job\_queue\_retention\_in\_seconds](#input\_job\_queue\_retention\_in\_seconds)| The number of seconds the job is held in the queue before it is purged. |`number`|`86400`| no |
168
170
| <aname="input_key_name"></a> [key\_name](#input\_key\_name)| Key pair name |`string`|`null`| no |
@@ -177,6 +179,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
177
179
| <aname="input_log_level"></a> [log\_level](#input\_log\_level)| Logging level for lambda logging. Valid values are 'silly', 'trace', 'debug', 'info', 'warn', 'error', 'fatal'. |`string`|`"info"`| no |
178
180
| <aname="input_logging_kms_key_id"></a> [logging\_kms\_key\_id](#input\_logging\_kms\_key\_id)| Specifies the kms key id to encrypt the logs with. |`string`|`null`| no |
179
181
| <aname="input_logging_retention_in_days"></a> [logging\_retention\_in\_days](#input\_logging\_retention\_in\_days)| Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653. |`number`|`180`| no |
182
+
| <aname="input_metrics_namespace"></a> [metrics\_namespace](#input\_metrics\_namespace)| The namespace for the metrics created by the module. Merics will only be created if explicit enabled. |`string`|`"GitHub Runners"`| no |
180
183
| <aname="input_minimum_running_time_in_minutes"></a> [minimum\_running\_time\_in\_minutes](#input\_minimum\_running\_time\_in\_minutes)| The time an ec2 action runner should be running at minimum before terminated, if not busy. |`number`|`null`| no |
181
184
| <aname="input_pool_config"></a> [pool\_config](#input\_pool\_config)| The configuration for updating the pool. The `pool_size` to adjust to by the events triggered by the `schedule_expression`. For example you can configure a cron expression for weekdays to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1. | <pre>list(object({<br> schedule_expression = string<br> size = number<br> }))</pre> |`[]`| no |
182
185
| <aname="input_pool_lambda_memory_size"></a> [pool\_lambda\_memory\_size](#input\_pool\_lambda\_memory\_size)| Memory size limit for scale-up lambda. |`number`|`512`| no |
@@ -248,6 +251,7 @@ Talk to the forestkeepers in the `runners-channel` on Slack.
Copy file name to clipboardExpand all lines: docs/configuration.md
+37
Original file line number
Diff line number
Diff line change
@@ -175,6 +175,11 @@ This tracing config generates timelines for following events:
175
175
176
176
This feature has been disabled by default.
177
177
178
+
### Multiple runner module in your AWS account
179
+
180
+
The watcher will act on all spot termination notificatins and log all onses relevant to the runner module. Therefor we suggest to only deploy the watcher once. You can either deploy the watcher by enabling in one of your deployments or deploy the watcher as a stand alone module.
181
+
182
+
178
183
## Debugging
179
184
180
185
In case the setup does not work as intended, trace the events through this sequence:
@@ -187,6 +192,38 @@ In case the setup does not work as intended, trace the events through this seque
187
192
188
193
## Experimental features
189
194
195
+
### Termination watcher
196
+
197
+
This feature is in early stage and therefore disabled by default.
198
+
199
+
The termination watcher is currently watching for spot termination notifications. The module is only taken events into account for instances tagged with `ghr:environment` by default when deployment the module as part of one of the main modules (root or multi-runner). The module can also be deployed stand-alone, in that case the tag filter needs to be tunned.
200
+
201
+
- Logs: The module will log all termination notifications. For each warning it will look up instance details and log the environment, instance type and time the instance is running. As well some other details.
202
+
- Metrics: Metrics are disabled by default, this to avoid costs. Once enabled a metric will be created for each warning with at least dimensions for the environment and instance type. THe metric name space can be configured via the variables. The metric name used is `SpotInterruptionWarning`.
203
+
204
+
#### Log example
205
+
206
+
Below an example of the the log messages created.
207
+
208
+
```
209
+
{
210
+
"level": "INFO",
211
+
"message": "Received spot notification warning:",
212
+
"environment": "default",
213
+
"instanceId": "i-0039b8826b3dcea55",
214
+
"instanceType": "c5.large",
215
+
"instanceLaunchTime": "2024-03-15T08:10:34.000Z",
216
+
"instanceRunningTimeInSeconds": 68,
217
+
"tags": [
218
+
{
219
+
"Key": "ghr:environment",
220
+
"Value": "default"
221
+
}
222
+
... all tags ...
223
+
]
224
+
}
225
+
```
226
+
190
227
### Queue to publish workflow job events
191
228
192
229
This queue is an experimental feature to allow you to receive a copy of the wokflow_jobs events sent by the GitHub App. This can be used to calculate a matrix or monitor the system.
Copy file name to clipboardExpand all lines: docs/index.md
+5
Original file line number
Diff line number
Diff line change
@@ -64,6 +64,11 @@ The control plane (scale up lambda) will store the runner registration configura
64
64
65
65
The AMI cleaner is a lambda that will clean up AMIs that are older than a configurable amount of days. This is useful when using the AMI builder to create AMIs. The cleaner will also check which AMIs are used the latest version of the launch template. And you can provide SSM config paths pointing to AMI IDs. The cleaner will not delete these AMIs. The AMI cleaner is opt in, it will not be created by default.
66
66
67
+
### Instance Termination Watcher
68
+
69
+
> This feature is Beta, changes will not trigger a major release as long in beta.
70
+
71
+
The Instance Termination Watcher is creating log and optional metrics for termination of instances. Currently only spot termination warnings are watched. See [configuration](configuration/) for more details.
0 commit comments