Skip to content
This repository was archived by the owner on Jan 16, 2025. It is now read-only.

Commit 8197432

Browse files
npalmaxel3rdScottGuymer
authored
feat: Add scheduled / pull based scaling for org level runners (#1577)
- Add opt-in runneer pool for org leve - Refactor unit tests for runners lambda - Update VSCode recommandations Co-authored-by: Alix Lourme <[email protected]> Co-authored-by: Scott Guymer <[email protected]>
1 parent fbd7241 commit 8197432

34 files changed

+1962
-1154
lines changed

Diff for: .vscode/extensions.json

+3-2
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
// Extension identifier format: ${publisher}.${name}. Example: vscode.csharp
66
"editorconfig.editorconfig",
77
"yzhang.markdown-all-in-one",
8-
"mauve.terraform"
8+
"sonarsource.sonarlint-vscode",
9+
"hashicorp.terraform"
910
]
10-
}
11+
}

Diff for: .vscode/settings.json

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"sonarlint.rules": {
3+
"javascript:S4123": {
4+
"level": "off"
5+
}
6+
}
7+
}

Diff for: README.md

+39-16
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ This [Terraform](https://www.terraform.io/) module creates the required infrastr
2222
- [Option 2: App](#option-2-app)
2323
- [Install app](#install-app)
2424
- [Encryption](#encryption)
25+
- [Pool](#pool)
2526
- [Idle runners](#idle-runners)
2627
- [Ephemeral runners](#ephemeral-runners)
2728
- [Prebuilt Images](#prebuilt-images)
@@ -87,7 +88,7 @@ To be able to support a number of use-cases the module has quite a lot configura
8788
- Linux vs Windows. you can configure the os types linux and win. Linux will be used by default.
8889
- Re-use vs Ephemeral. By default runners are re-used for till detected idle, once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners are only working in combination with the workflow job event. We also suggest to use a pre-build AMI to improve the start time of jobs.
8990
- GitHub cloud vs GitHub enterprise server (GHES). The runner support GitHub cloud as well GitHub enterprise service. For GHES we rely on our community to test and support. We have no possibility to test ourselves on GHES.
90-
- Spot vs on-demand. The runners using either the EC2 spot or on-demand life cycle. Runners will be created via the AWS [CreateFleet API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CreateFleet.html). The module (scale up lambda) will request an instance via the create fleet API in one of the subnets and matching one of the specified instance types.
91+
- Spot vs on-demand. The runners using either the EC2 spot or on-demand life cycle. Runners will be created via the AWC [CreateFLeet API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CreateFleet.html). THe module (scale up lambda) will request via the create fleet API aan instance in one of the subnets and matching one of the specified instances types.
9192

9293

9394
#### ARM64 support via Graviton/Graviton2 instance-types
@@ -251,6 +252,22 @@ module "runners" {
251252
252253
```
253254

255+
### Pool
256+
257+
The module basically supports two options for keeping a pool of runners. One is via a pool which only supports org-level runners, the second option is [keeping runners idle](#idle-runners).
258+
259+
The pool is introduced in combination with the ephemeral runners and is primary meant to ensure if any event is unexpected dropped, and no runner was created the pool can pick up the job. The pool is maintained by a lambda. Each time the lambda is triggered a check is preformed if the number of idler runners managed by the module are meeting the expected pool size. If not, the pool will be adjusted. Keep in mind that the scale down function is still active and will terminate instances that are detected to long as idle.
260+
261+
```hcl
262+
pool_runner_owner = "my-org" # Org to which the runners are added
263+
pool_config = [{
264+
size = 20 # size of the pool
265+
schedule_expression = "cron(* * * * ? *)" # cron expression to trigger the adjustment of the pool
266+
}]
267+
```
268+
269+
The pool is NOT enabled by default can can be enabled by setting the at least one object to the pool config list. The [ephemeral example](./examples/ephemeral/README.md) contains a configuration options (commented out).
270+
254271
### Idle runners
255272

256273
The module will scale down to zero runners be default, by specifying a `idle_config` config idle runners can be kept active. The scale down lambda checks if any of the cron expressions matches the current time with a marge of 5 seconds. When there is a match the number of runners specified in the idle config will be kept active. In case multiple cron expressions matches only the first one is taken in to account. Below an idle configuration for keeping runners active from 9 to 5 on working days.
@@ -265,20 +282,6 @@ idle_config = [{
265282

266283
_**Note**_: When using Windows runners it's recommended to keep a few runners warmed up due to the minutes-long cold start time.
267284

268-
### Ephemeral runners
269-
270-
Currently a beta feature! You can configure runners to be ephemeral, runners will be used only for one job. The feature should be used in conjunction with listening for the workflow job event. Please consider the following:
271-
272-
- The scale down lambda is still active, and should only remove orphan instances. But there is no strict check in place. So ensure you configure the `minimum_running_time_in_minutes` to a value that is high enough to got your runner booted and connected to avoid it got terminated before executing a job.
273-
- The messages sent from the webhook lambda to scale-up lambda are by default delayed delayed by SQS, to give available runners to option to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set `delay_webhook_event` to `0`.
274-
- To ensure runners are created in the same order GitHub sends the events we use by default a FIFO queue, this is mainly relevant for repo level runners. For ephemeral runners you can set `fifo_build_queue` to `false`.
275-
- Error related to scaling should be retried via SQS. You can configure `job_queue_retention_in_seconds` `redrive_build_queue` to tune the behavior. We have no mechanism to avoid events will never processed, which means potential no runner could be created and the job in GitHub can time out in 6 hours.
276-
277-
The example for [ephemeral runners](./examples/ephemeral) is based on the [default example](./examples/default). Have look on the diff to see the major configuration differences.
278-
279-
### Prebuilt Images
280-
281-
This module also allows you to run agents from a prebuilt AMI to gain faster startup times. You can find more information in [the image README.md](/images/README.md)
282285

283286
#### Supported config <!-- omit in toc -->
284287

@@ -298,6 +301,22 @@ Cron expressions are parsed by [cron-parser](https://github.com/harrisiirak/cron
298301

299302
For time zones please check [TZ database name column](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) for the supported values.
300303

304+
### Ephemeral runners
305+
306+
Currently a beta feature! You can configure runners to be ephemeral, runners will be used only for one job. The feature should be used in conjunction with listening for the workflow job event. Please consider the following:
307+
308+
- The scale down lambda is still active, and should only remove orphan instances. But there is no strict check in place. So ensure you configure the `minimum_running_time_in_minutes` to a value that is high enough to got your runner booted and connected to avoid it got terminated before executing a job.
309+
- The messages sent from the webhook lambda to scale-up lambda are by default delayed delayed by SQS, to give available runners to option to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set `delay_webhook_event` to `0`.
310+
- To ensure runners are created in the same order GitHub sends the events we use by default a FIFO queue, this is mainly relevant for repo level runners. For ephemeral runners you can set `fifo_build_queue` to `false`.
311+
- Error related to scaling should be retried via SQS. You can configure `job_queue_retention_in_seconds` `redrive_build_queue` to tune the behavior. We have no mechanism to avoid events will never processed, which means potential no runner could be created and the job in GitHub can time out in 6 hours.
312+
313+
The example for [ephemeral runners](./examples/ephemeral) is based on the [default example](./examples/default). Have look on the diff to see the major configuration differences.
314+
315+
### Prebuilt Images
316+
317+
This module also allows you to run agents from a prebuilt AMI to gain faster startup times. You can find more information in [the image README.md](/images/README.md)
318+
319+
301320
## Examples
302321

303322
Examples are located in the [examples](./examples) directory. The following examples are provided:
@@ -326,7 +345,7 @@ The following sub modules are optional and are provided as example or utility:
326345

327346
### ARM64 configuration for submodules
328347

329-
When using the top-level module configure `runner_architecture = arm64` and ensure the list of `instance_types` matches. When not using the top-level ensure the bot properties are set on the submodules.
348+
When using the top level module configure `runner_architecture = arm64` and insure the list of `instance_types` matches. When not using the top-level ensure the bot properties are set on the submodules.
330349

331350
## Debugging
332351

@@ -411,6 +430,10 @@ In case the setup does not work as intended follow the trace of events:
411430
| <a name="input_logging_retention_in_days"></a> [logging\_retention\_in\_days](#input\_logging\_retention\_in\_days) | Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653. | `number` | `180` | no |
412431
| <a name="input_market_options"></a> [market\_options](#input\_market\_options) | DEPCRECATED: Replaced by `instance_target_capacity_type`. | `string` | `null` | no |
413432
| <a name="input_minimum_running_time_in_minutes"></a> [minimum\_running\_time\_in\_minutes](#input\_minimum\_running\_time\_in\_minutes) | The time an ec2 action runner should be running at minimum before terminated if not busy. | `number` | `null` | no |
433+
| <a name="input_pool_config"></a> [pool\_config](#input\_pool\_config) | The configuration for updating the pool. The `pool_size` to adjust to by the events triggered by the the `schedule_expression. For example you can configure a cron expression for week days to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1.` | <pre>list(object({<br> schedule_expression = string<br> size = number<br> }))</pre> | `[]` | no |
434+
| <a name="input_pool_lambda_reserved_concurrent_executions"></a> [pool\_lambda\_reserved\_concurrent\_executions](#input\_pool\_lambda\_reserved\_concurrent\_executions) | Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. | `number` | `1` | no |
435+
| <a name="input_pool_lambda_timeout"></a> [pool\_lambda\_timeout](#input\_pool\_lambda\_timeout) | Time out for the pool lambda lambda in seconds. | `number` | `60` | no |
436+
| <a name="input_pool_runner_owner"></a> [pool\_runner\_owner](#input\_pool\_runner\_owner) | The pool will deploy runners to the GitHub org ID, set this value to the org to which you want the runners deployed. Repo level is not supported. | `string` | `null` | no |
414437
| <a name="input_redrive_build_queue"></a> [redrive\_build\_queue](#input\_redrive\_build\_queue) | Set options to attach (optional) a dead letter queue to the build queue, the queue between the webhook and the scale up lambda. You have the following options. 1. Disable by setting, `enalbed' to false. 2. Enable by setting `enabled` to `true`, `maxReceiveCount` to a number of max retries.` | <pre>object({<br> enabled = bool<br> maxReceiveCount = number<br> })</pre> | <pre>{<br> "enabled": false,<br> "maxReceiveCount": null<br>}</pre> | no |
415438
| <a name="input_repository_white_list"></a> [repository\_white\_list](#input\_repository\_white\_list) | List of repositories allowed to use the github app | `list(string)` | `[]` | no |
416439
| <a name="input_role_path"></a> [role\_path](#input\_role\_path) | The path that will be added to role path for created roles, if not set the environment name will be used. | `string` | `null` | no |

Diff for: examples/ephemeral/main.tf

+10-3
Original file line numberDiff line numberDiff line change
@@ -57,10 +57,17 @@ module "runners" {
5757

5858
enable_ephemeral_runners = true
5959

60+
# # Example of simple pool usages
61+
# pool_runner_owner = "my-org"
62+
# pool_config = [{
63+
# size = 20
64+
# schedule_expression = "cron(* * * * ? *)"
65+
# }]
66+
6067
# configure your pre-built AMI
61-
enabled_userdata = false
62-
ami_filter = { name = ["github-runner-amzn2-x86_64-2021*"] }
63-
ami_owners = [data.aws_caller_identity.current.account_id]
68+
# enabled_userdata = false
69+
# ami_filter = { name = ["github-runner-amzn2-x86_64-2021*"] }
70+
# ami_owners = [data.aws_caller_identity.current.account_id]
6471

6572
# Enable logging
6673
log_level = "debug"

Diff for: main.tf

+5
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,11 @@ module "runners" {
159159

160160
log_type = var.log_type
161161
log_level = var.log_level
162+
163+
pool_config = var.pool_config
164+
pool_lambda_timeout = var.pool_lambda_timeout
165+
pool_runner_owner = var.pool_runner_owner
166+
pool_lambda_reserved_concurrent_executions = var.pool_lambda_reserved_concurrent_executions
162167
}
163168

164169
module "runner_binaries" {

Diff for: modules/runners/README.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,9 @@ yarn run dist
6363

6464
## Modules
6565

66-
No modules.
66+
| Name | Source | Version |
67+
|------|--------|---------|
68+
| <a name="module_pool"></a> [pool](#module\_pool) | ./pool | n/a |
6769

6870
## Resources
6971

@@ -149,6 +151,10 @@ No modules.
149151
| <a name="input_metadata_options"></a> [metadata\_options](#input\_metadata\_options) | Metadata options for the ec2 runner instances. | `map(any)` | <pre>{<br> "http_endpoint": "enabled",<br> "http_put_response_hop_limit": 1,<br> "http_tokens": "optional"<br>}</pre> | no |
150152
| <a name="input_minimum_running_time_in_minutes"></a> [minimum\_running\_time\_in\_minutes](#input\_minimum\_running\_time\_in\_minutes) | The time an ec2 action runner should be running at minimum before terminated if non busy. If not set the default is calculated based on the OS. | `number` | `null` | no |
151153
| <a name="input_overrides"></a> [overrides](#input\_overrides) | This map provides the possibility to override some defaults. The following attributes are supported: `name_sg` overrides the `Name` tag for all security groups created by this module. `name_runner_agent_instance` overrides the `Name` tag for the ec2 instance defined in the auto launch configuration. `name_docker_machine_runners` overrides the `Name` tag spot instances created by the runner agent. | `map(string)` | <pre>{<br> "name_runner": "",<br> "name_sg": ""<br>}</pre> | no |
154+
| <a name="input_pool_config"></a> [pool\_config](#input\_pool\_config) | The configuration for updating the pool. The `pool_size` to adjust to by the events triggered by the the `schedule_expression. For example you can configure a cron expression for week days to adjust the pool to 10 and another expression for the weekend to adjust the pool to 1.` | <pre>list(object({<br> schedule_expression = string<br> size = number<br> }))</pre> | `[]` | no |
155+
| <a name="input_pool_lambda_reserved_concurrent_executions"></a> [pool\_lambda\_reserved\_concurrent\_executions](#input\_pool\_lambda\_reserved\_concurrent\_executions) | Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. | `number` | `1` | no |
156+
| <a name="input_pool_lambda_timeout"></a> [pool\_lambda\_timeout](#input\_pool\_lambda\_timeout) | Time out for the pool lambda lambda in seconds. | `number` | `60` | no |
157+
| <a name="input_pool_runner_owner"></a> [pool\_runner\_owner](#input\_pool\_runner\_owner) | The pool will deploy runners to the GitHub org ID, set this value to the org to which you want the runners deployed. Repo level is not supported. | `string` | `null` | no |
152158
| <a name="input_role_path"></a> [role\_path](#input\_role\_path) | The path that will be added to the role; if not set, the environment name will be used. | `string` | `null` | no |
153159
| <a name="input_role_permissions_boundary"></a> [role\_permissions\_boundary](#input\_role\_permissions\_boundary) | Permissions boundary that will be added to the created role for the lambda. | `string` | `null` | no |
154160
| <a name="input_runner_additional_security_group_ids"></a> [runner\_additional\_security\_group\_ids](#input\_runner\_additional\_security\_group\_ids) | (optional) List of additional security groups IDs to apply to the runner | `list(string)` | `[]` | no |

Diff for: modules/runners/lambdas/runners/.prettierrc

+7-1
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,10 @@
33
"singleQuote": true,
44
"trailingComma": "all",
55
"semi": true,
6-
}
6+
"importOrderSeparation": true,
7+
"importOrderSortSpecifiers": true,
8+
"importOrder": [
9+
"<THIRD_PARTY_MODULES>",
10+
"^[./]"
11+
]
12+
}

Diff for: modules/runners/lambdas/runners/package.json

+2
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
"all": "yarn build && yarn format && yarn lint && yarn test"
1717
},
1818
"devDependencies": {
19+
"@trivago/prettier-plugin-sort-imports": "^3.1.1",
1920
"@types/aws-lambda": "^8.10.89",
2021
"@types/express": "^4.17.11",
2122
"@types/jest": "^27.4.0",
@@ -25,6 +26,7 @@
2526
"eslint": "^7.32.0",
2627
"eslint-plugin-prettier": "4.0.0",
2728
"jest": "27.4.5",
29+
"jest-mock": "^27.4.6",
2830
"jest-mock-extended": "^2.0.1",
2931
"moment-timezone": "^0.5.34",
3032
"nock": "^13.2.1",

Diff for: modules/runners/lambdas/runners/src/aws/runners.test.ts

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import { EC2 } from 'aws-sdk';
2-
import { listEC2Runners, createRunner, terminateRunner, RunnerInfo, RunnerInputParameters } from './runners';
2+
33
import ScaleError from './../scale-runners/ScaleError';
4+
import { RunnerInfo, RunnerInputParameters, createRunner, listEC2Runners, terminateRunner } from './runners';
45

56
const mockEC2 = { describeInstances: jest.fn(), createFleet: jest.fn(), terminateInstances: jest.fn() };
67
const mockSSM = { putParameter: jest.fn() };

Diff for: modules/runners/lambdas/runners/src/aws/runners.ts

+22-3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import { EC2, SSM } from 'aws-sdk';
2-
import { logger as rootLogger, LogFields } from '../logger';
2+
3+
import { LogFields, logger as rootLogger } from '../logger';
34
import ScaleError from './../scale-runners/ScaleError';
45

56
const logger = rootLogger.getChildLogger({ name: 'runners' });
@@ -24,6 +25,7 @@ export interface ListRunnerFilters {
2425
runnerType?: 'Org' | 'Repo';
2526
runnerOwner?: string;
2627
environment?: string;
28+
statuses?: string[];
2729
}
2830

2931
export interface RunnerInputParameters {
@@ -43,11 +45,13 @@ export interface RunnerInputParameters {
4345
}
4446

4547
export async function listEC2Runners(filters: ListRunnerFilters | undefined = undefined): Promise<RunnerList[]> {
48+
const ec2Statuses = filters?.statuses ? filters.statuses : ['running', 'pending'];
4649
const ec2 = new EC2();
4750
const ec2Filters = [
4851
{ Name: 'tag:Application', Values: ['github-action-runner'] },
49-
{ Name: 'instance-state-name', Values: ['running', 'pending'] },
52+
{ Name: 'instance-state-name', Values: ec2Statuses },
5053
];
54+
5155
if (filters) {
5256
if (filters.environment !== undefined) {
5357
ec2Filters.push({ Name: 'tag:Environment', Values: [filters.environment] });
@@ -57,7 +61,22 @@ export async function listEC2Runners(filters: ListRunnerFilters | undefined = un
5761
ec2Filters.push({ Name: `tag:Owner`, Values: [filters.runnerOwner] });
5862
}
5963
}
60-
const runningInstances = await ec2.describeInstances({ Filters: ec2Filters }).promise();
64+
65+
const runners: RunnerList[] = [];
66+
let nextToken;
67+
let hasNext = true;
68+
while (hasNext) {
69+
const runningInstances: EC2.DescribeInstancesResult = await ec2
70+
.describeInstances({ Filters: ec2Filters, NextToken: nextToken })
71+
.promise();
72+
hasNext = runningInstances.NextToken ? true : false;
73+
nextToken = runningInstances.NextToken;
74+
runners.push(...getRunnerInfo(runningInstances));
75+
}
76+
return runners;
77+
}
78+
79+
function getRunnerInfo(runningInstances: EC2.DescribeInstancesResult) {
6180
const runners: RunnerList[] = [];
6281
if (runningInstances.Reservations) {
6382
for (const r of runningInstances.Reservations) {

Diff for: modules/runners/lambdas/runners/src/aws/ssm.test.ts

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
1+
import { GetParameterCommandOutput, SSM } from '@aws-sdk/client-ssm';
12
import nock from 'nock';
3+
24
import { getParameterValue } from './ssm';
3-
import { SSM, GetParameterCommandOutput } from '@aws-sdk/client-ssm';
45

56
jest.mock('@aws-sdk/client-ssm');
67

0 commit comments

Comments
 (0)