Skip to content
This repository was archived by the owner on Jan 16, 2025. It is now read-only.

Commit 7eb0bda

Browse files
authored
feat: Add option for ephemeral to check builds status before scaling (#1854)
1 parent d1d1c84 commit 7eb0bda

File tree

10 files changed

+44
-3
lines changed

10 files changed

+44
-3
lines changed

Diff for: README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -304,10 +304,11 @@ For time zones please check [TZ database name column](https://en.wikipedia.org/w
304304
Currently a beta feature! You can configure runners to be ephemeral, runners will be used only for one job. The feature should be used in conjunction with listening for the workflow job event. Please consider the following:
305305

306306
- The scale down lambda is still active, and should only remove orphan instances. But there is no strict check in place. So ensure you configure the `minimum_running_time_in_minutes` to a value that is high enough to got your runner booted and connected to avoid it got terminated before executing a job.
307-
- The messages sent from the webhook lambda to scale-up lambda are by default delayed delayed by SQS, to give available runners to option to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set `delay_webhook_event` to `0`.
307+
- The messages sent from the webhook lambda to scale-up lambda are by default delayed delayed by SQS, to give available runners to option to start the job before the decision is made to scale more runners. For ephemeral runners there is no need to wait. Set `delay_webhook_event` to `0`.
308+
- All events on the queue will lead to a new runner crated by the lambda. By setting `enable_job_queued_check` to `true` you can enforce only create a runner if the event has a correlated queued job. Setting this can avoid creating useless runners, for example whn jobs got cancelled before a runner is created. We suggest to use this in combination with a pool.
308309
- To ensure runners are created in the same order GitHub sends the events we use by default a FIFO queue, this is mainly relevant for repo level runners. For ephemeral runners you can set `fifo_build_queue` to `false`.
309310
- Error related to scaling should be retried via SQS. You can configure `job_queue_retention_in_seconds` `redrive_build_queue` to tune the behavior. We have no mechanism to avoid events will never processed, which means potential no runner could be created and the job in GitHub can time out in 6 hours.
310-
311+
311312
The example for [ephemeral runners](./examples/ephemeral) is based on the [default example](./examples/default). Have look on the diff to see the major configuration differences.
312313

313314
### Prebuilt Images
@@ -407,6 +408,7 @@ In case the setup does not work as intended follow the trace of events:
407408
| <a name="input_disable_runner_autoupdate"></a> [disable\_runner\_autoupdate](#input\_disable\_runner\_autoupdate) | Disable the auto update of the github runner agent. Be-aware there is a grace period of 30 days, see also the [GitHub article](https://github.blog/changelog/2022-02-01-github-actions-self-hosted-runners-can-now-disable-automatic-updates/) | `bool` | `false` | no |
408409
| <a name="input_enable_cloudwatch_agent"></a> [enable\_cloudwatch\_agent](#input\_enable\_cloudwatch\_agent) | Enabling the cloudwatch agent on the ec2 runner instances, the runner contains default config. Configuration can be overridden via `cloudwatch_config`. | `bool` | `true` | no |
409410
| <a name="input_enable_ephemeral_runners"></a> [enable\_ephemeral\_runners](#input\_enable\_ephemeral\_runners) | Enable ephemeral runners, runners will only be used once. | `bool` | `false` | no |
411+
| <a name="input_enable_job_queued_check"></a> [enable\_job\_queued\_check](#input\_enable\_job\_queued\_check) | Only scale if the job event received by the scale up lambda is is in the state queued. By default enabled for non ephemeral runners and disabled for ephemeral. Set this variable to overwrite the default behavior. | `bool` | `null` | no |
410412
| <a name="input_enable_managed_runner_security_group"></a> [enable\_managed\_runner\_security\_group](#input\_enable\_managed\_runner\_security\_group) | Enabling the default managed security group creation. Unmanaged security groups can be specified via `runner_additional_security_group_ids`. | `bool` | `true` | no |
411413
| <a name="input_enable_organization_runners"></a> [enable\_organization\_runners](#input\_enable\_organization\_runners) | Register runners to organization, instead of repo level | `bool` | `false` | no |
412414
| <a name="input_enable_ssm_on_runners"></a> [enable\_ssm\_on\_runners](#input\_enable\_ssm\_on\_runners) | Enable to allow access the runner instances for debugging purposes via SSM. Note that this adds additional permissions to the runner instances. | `bool` | `false` | no |

Diff for: examples/ephemeral/main.tf

+3
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,9 @@ module "runners" {
6363
# size = 20
6464
# schedule_expression = "cron(* * * * ? *)"
6565
# }]
66+
#
67+
#
68+
enable_job_queued_check = true
6669

6770
# configure your pre-built AMI
6871
# enabled_userdata = false

Diff for: main.tf

+1
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,7 @@ module "runners" {
148148
github_app_parameters = local.github_app_parameters
149149
enable_organization_runners = var.enable_organization_runners
150150
enable_ephemeral_runners = var.enable_ephemeral_runners
151+
enable_job_queued_check = var.enable_job_queued_check
151152
disable_runner_autoupdate = var.disable_runner_autoupdate
152153
enable_managed_runner_security_group = var.enable_managed_runner_security_group
153154
scale_down_schedule_expression = var.scale_down_schedule_expression

Diff for: modules/runners/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ yarn run dist
124124
| <a name="input_egress_rules"></a> [egress\_rules](#input\_egress\_rules) | List of egress rules for the GitHub runner instances. | <pre>list(object({<br> cidr_blocks = list(string)<br> ipv6_cidr_blocks = list(string)<br> prefix_list_ids = list(string)<br> from_port = number<br> protocol = string<br> security_groups = list(string)<br> self = bool<br> to_port = number<br> description = string<br> }))</pre> | <pre>[<br> {<br> "cidr_blocks": [<br> "0.0.0.0/0"<br> ],<br> "description": null,<br> "from_port": 0,<br> "ipv6_cidr_blocks": [<br> "::/0"<br> ],<br> "prefix_list_ids": null,<br> "protocol": "-1",<br> "security_groups": null,<br> "self": null,<br> "to_port": 0<br> }<br>]</pre> | no |
125125
| <a name="input_enable_cloudwatch_agent"></a> [enable\_cloudwatch\_agent](#input\_enable\_cloudwatch\_agent) | Enabling the cloudwatch agent on the ec2 runner instances, the runner contains default config. Configuration can be overridden via `cloudwatch_config`. | `bool` | `true` | no |
126126
| <a name="input_enable_ephemeral_runners"></a> [enable\_ephemeral\_runners](#input\_enable\_ephemeral\_runners) | Enable ephemeral runners, runners will only be used once. | `bool` | `false` | no |
127+
| <a name="input_enable_job_queued_check"></a> [enable\_job\_queued\_check](#input\_enable\_job\_queued\_check) | Only scale if the job event received by the scale up lambda is is in the state queued. By default enabled for non ephemeral runners and disabled for ephemeral. Set this variable to overwrite the default behavior. | `bool` | `null` | no |
127128
| <a name="input_enable_managed_runner_security_group"></a> [enable\_managed\_runner\_security\_group](#input\_enable\_managed\_runner\_security\_group) | Enabling the default managed security group creation. Unmanaged security groups can be specified via `runner_additional_security_group_ids`. | `bool` | `true` | no |
128129
| <a name="input_enable_organization_runners"></a> [enable\_organization\_runners](#input\_enable\_organization\_runners) | n/a | `bool` | n/a | yes |
129130
| <a name="input_enable_ssm_on_runners"></a> [enable\_ssm\_on\_runners](#input\_enable\_ssm\_on\_runners) | Enable to allow access to the runner instances for debugging purposes via SSM. Note that this adds additional permissions to the runner instances. | `bool` | n/a | yes |

Diff for: modules/runners/lambdas/runners/src/scale-runners/scale-up.test.ts

+18
Original file line numberDiff line numberDiff line change
@@ -362,6 +362,12 @@ describe('scaleUp with public GH', () => {
362362
});
363363
});
364364

365+
it('not checking queued workflows', async () => {
366+
process.env.ENABLE_JOB_QUEUED_CHECK = 'false';
367+
await scaleUpModule.scaleUp('aws:sqs', TEST_DATA);
368+
expect(mockOctokit.actions.getJobForWorkflowRun).not.toBeCalled();
369+
});
370+
365371
it('does not retrieve installation id if already set', async () => {
366372
const appSpy = jest.spyOn(ghAuth, 'createGithubAppAuth');
367373
const installationSpy = jest.spyOn(ghAuth, 'createGithubInstallationAuth');
@@ -535,6 +541,7 @@ describe('scaleUp with public GH', () => {
535541

536542
it('ephemeral runners only run with workflow_job event, others should fail.', async () => {
537543
process.env.ENABLE_EPHEMERAL_RUNNERS = 'true';
544+
process.env.ENABLE_JOB_QUEUED_CHECK = 'false';
538545
await expect(
539546
scaleUpModule.scaleUp('aws:sqs', {
540547
...TEST_DATA,
@@ -545,7 +552,18 @@ describe('scaleUp with public GH', () => {
545552

546553
it('creates a ephemeral runner.', async () => {
547554
process.env.ENABLE_EPHEMERAL_RUNNERS = 'true';
555+
process.env.ENABLE_JOB_QUEUED_CHECK = 'false';
556+
await scaleUpModule.scaleUp('aws:sqs', TEST_DATA);
557+
expectedRunnerParams.runnerServiceConfig = [...expectedRunnerParams.runnerServiceConfig, `--ephemeral`];
558+
expect(mockOctokit.actions.getJobForWorkflowRun).not.toBeCalled();
559+
expect(createRunner).toBeCalledWith(expectedRunnerParams);
560+
});
561+
562+
it('creates a ephemeral runner after checking job is queued.', async () => {
563+
process.env.ENABLE_EPHEMERAL_RUNNERS = 'true';
564+
process.env.ENABLE_JOB_QUEUED_CHECK = 'true';
548565
await scaleUpModule.scaleUp('aws:sqs', TEST_DATA);
566+
expect(mockOctokit.actions.getJobForWorkflowRun).toBeCalled();
549567
expectedRunnerParams.runnerServiceConfig = [...expectedRunnerParams.runnerServiceConfig, `--ephemeral`];
550568
expect(createRunner).toBeCalledWith(expectedRunnerParams);
551569
});

Diff for: modules/runners/lambdas/runners/src/scale-runners/scale-up.ts

+2-1
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ export async function scaleUp(eventSource: string, payload: ActionRequestMessage
158158
const launchTemplateName = process.env.LAUNCH_TEMPLATE_NAME;
159159
const instanceMaxSpotPrice = process.env.INSTANCE_MAX_SPOT_PRICE;
160160
const instanceAllocationStrategy = process.env.INSTANCE_ALLOCATION_STRATEGY || 'lowest-price'; // same as AWS default
161+
const enableJobQueuedCheck = yn(process.env.ENABLE_JOB_QUEUED_CHECK, { default: true });
161162

162163
if (ephemeralEnabled && payload.eventType !== 'workflow_job') {
163164
logger.warn(
@@ -190,7 +191,7 @@ export async function scaleUp(eventSource: string, payload: ActionRequestMessage
190191
const ghAuth = await createGithubInstallationAuth(installationId, ghesApiUrl);
191192
const githubInstallationClient = await createOctoClient(ghAuth.token, ghesApiUrl);
192193

193-
if (ephemeral || (await isJobQueued(githubInstallationClient, payload))) {
194+
if (!enableJobQueuedCheck || (await isJobQueued(githubInstallationClient, payload))) {
194195
const currentRunners = await listEC2Runners({
195196
environment,
196197
runnerType,

Diff for: modules/runners/main.tf

+2
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ locals {
3535
}
3636

3737
ami_filter = coalesce(var.ami_filter, local.default_ami[var.runner_os])
38+
39+
enable_job_queued_check = var.enable_job_queued_check == null ? !var.enable_ephemeral_runners : var.enable_job_queued_check
3840
}
3941

4042
data "aws_ami" "runner" {

Diff for: modules/runners/scale-up.tf

+1
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ resource "aws_lambda_function" "scale_up" {
1717
variables = {
1818
DISABLE_RUNNER_AUTOUPDATE = var.disable_runner_autoupdate
1919
ENABLE_EPHEMERAL_RUNNERS = var.enable_ephemeral_runners
20+
ENABLE_JOB_QUEUED_CHECK = local.enable_job_queued_check
2021
ENABLE_ORGANIZATION_RUNNERS = var.enable_organization_runners
2122
ENVIRONMENT = var.environment
2223
GHES_URL = var.ghes_url

Diff for: modules/runners/variables.tf

+6
Original file line numberDiff line numberDiff line change
@@ -481,6 +481,12 @@ variable "enable_ephemeral_runners" {
481481
default = false
482482
}
483483

484+
variable "enable_job_queued_check" {
485+
description = "Only scale if the job event received by the scale up lambda is is in the state queued. By default enabled for non ephemeral runners and disabled for ephemeral. Set this variable to overwrite the default behavior."
486+
type = bool
487+
default = null
488+
}
489+
484490
variable "pool_lambda_timeout" {
485491
description = "Time out for the pool lambda lambda in seconds."
486492
type = number

Diff for: variables.tf

+6
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,12 @@ variable "enable_ephemeral_runners" {
507507
default = false
508508
}
509509

510+
variable "enable_job_queued_check" {
511+
description = "Only scale if the job event received by the scale up lambda is is in the state queued. By default enabled for non ephemeral runners and disabled for ephemeral. Set this variable to overwrite the default behavior."
512+
type = bool
513+
default = null
514+
}
515+
510516
variable "enable_managed_runner_security_group" {
511517
description = "Enabling the default managed security group creation. Unmanaged security groups can be specified via `runner_additional_security_group_ids`."
512518
type = bool

0 commit comments

Comments
 (0)