Skip to content
This repository was archived by the owner on Jan 16, 2025. It is now read-only.

Commit 2b776ba

Browse files
GuptaNavdeep1983github-actions[bot]npalm
authored
feat!: replace registration tokens by JIT config for ephemeral runners (#3350)
* fix: ephemeral runners integration with new one-time-token. * docs: auto update terraform docs * fix: remove unused. * fix: reverted few changes. * fix: formatting * fix: removed commented code. * fix: resolution. * token should be passed to SSM in config string Token was stored masked / redacted in the the paramater store. Which cause the runner was not starting. * fix multi runner example - pass all labels to jitconfig - pass user to runner start * fix pool and add debug logging g * fix linting errors * fix tflint errors * fix: Runners module variable name change. (#3377) * fix: variable runner_extra_labels renamed. * docs: auto update terraform docs * fix pool and add debug logging g * fix: tests. * fix: coverage. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niek Palm <[email protected]> Co-authored-by: Niek Palm <[email protected]> * docs: auto update terraform docs * format * Update lambdas/functions/control-plane/src/scale-runners/scale-up.ts Co-authored-by: Niek Palm <[email protected]> * fix: comments. * multi runner inputs remain extra_labels * update docs * docs: auto update terraform docs --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niek Palm <[email protected]> Co-authored-by: Niek Palm <[email protected]>
1 parent 65f74e3 commit 2b776ba

File tree

32 files changed

+511
-322
lines changed

32 files changed

+511
-322
lines changed

Diff for: .github/lint/tflint.tfvars

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
aws_region = null
1+
aws_region = "eu-west-1"
22
github_app = {
33
id = "0"
44
key_base64 = "0"

Diff for: README.md

+9-7
Original file line numberDiff line numberDiff line change
@@ -35,14 +35,14 @@ This [Terraform](https://www.terraform.io/) module creates the required infrastr
3535
- [Sub modules](#sub-modules)
3636
- [Logging](#logging)
3737
- [Debugging](#debugging)
38-
- [Security Consideration](#security-consideration)
38+
- [Security Considerations](#security-considerations)
3939
- [Requirements](#requirements)
4040
- [Providers](#providers)
4141
- [Modules](#modules)
4242
- [Resources](#resources)
4343
- [Inputs](#inputs)
4444
- [Outputs](#outputs)
45-
- [Contribution](#contribution)
45+
- [Contributing](#contributing)
4646
- [Philips Forest](#philips-forest)
4747

4848
## Motivation
@@ -66,7 +66,7 @@ In AWS an [API gateway](https://docs.aws.amazon.com/apigateway/index.html) endpo
6666

6767
The "scale up runner" lambda listens to the SQS queue and picks up events. The lambda runs various checks to decide whether a new EC2 spot instance needs to be created. For example, the instance is not created if the build is already started by an existing runner, or the maximum number of runners is reached.
6868

69-
The Lambda first requests a registration token from GitHub, which is needed later by the runner to register itself. This avoids the case that the EC2 instance, which later in the process will install the agent, needs administration permissions to register the runner. Next, the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a [`user_data`](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) script. This script will install the required software and configure it. The registration token for the action runner is stored in the parameter store (SSM), from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished, the action runner should be online, and the workflow will start in seconds.
69+
The Lambda first requests a JIT configuration or registration token from GitHub, which is needed later by the runner to register itself. This avoids the case that the EC2 instance, which later in the process will install the agent, needs administration permissions to register the runner. Next, the EC2 spot instance is created via the launch template. The launch template defines the specifications of the required instance and contains a [`user_data`](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) script. This script will install the required software and configure it. The registration token for the action runner is stored in the parameter store (SSM), from which the user data script will fetch it and delete it once it has been retrieved. Once the user data script is finished, the action runner should be online, and the workflow will start in seconds.
7070

7171
Scaling down the runners is at the moment brute-forced, every configurable amount of minutes a lambda will check every runner (instance) if it is busy. In case the runner is not busy it will be removed from GitHub and the instance terminated in AWS. At the moment there seems to be no other option to scale down more smoothly.
7272

@@ -92,7 +92,7 @@ To be able to support a number of use-cases the module has quite a lot of config
9292
- Multi-Runner module. This modules allows you to create multiple runner configurations with a single webhook and single GitHub App to simplify deployment of different types of runners. Refer to the [ReadMe](.modules/../modules/multi-runner/README.md) for more information to understand the functionality.
9393
- Workflow job event. You can configure the webhook in GitHub to send workflow job events to the webhook. Workflow job events were introduced by GitHub in September 2021 and are designed to support scalable runners. We advise using the workflow job event when possible.
9494
- Linux vs Windows. You can configure the OS types linux and win. Linux will be used by default.
95-
- Re-use vs Ephemeral. By default runners are re-used, until detected idle. Once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners are only working in combination with the workflow job event. We also suggest using a pre-build AMI to improve the start time of jobs.
95+
- Re-use vs Ephemeral. By default runners are re-used, until detected idle. Once idle they will be removed from the pool. To improve security we are introducing ephemeral runners. Those runners are only used for one job. Ephemeral runners are only working in combination with the workflow job event. For ephemeral runners the lambda requests a JIT (just in time) configuration object via the GitHub to register the runner. [JIT configuration](https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-just-in-time-runners) is limited to ephemeral runners, for non ephemeral a registration token is requested. In both cases the configuration is made available to the instance via the same SSM parameter. We also suggest using a pre-build AMI to improve the start time of jobs.
9696
- GitHub Cloud vs GitHub Enterprise Server (GHES). The runners support GitHub Cloud as well GitHub Enterprise Server. For GHES we rely on our community for support and testing. We have no possibility to test ourselves on GHES.
9797
- Spot vs on-demand. The runners use either the EC2 spot or on-demand life cycle. Runners will be created via the AWS [CreateFleet API](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CreateFleet.html). The module (scale up lambda) will request via the CreateFleet API to create instances in one of the subnets and of the specified instance types.
9898
- ARM64 support via Graviton/Graviton2 instance-types. When using the default example or top-level module, specifying `instance_types` that match a Graviton/Graviton 2 (ARM64) architecture (e.g. a1, t4g or any 6th-gen `g` or `gd` type), you must also specify `runner_architecture = "arm64"` and the sub-modules will be automatically configured to provision with ARM64 AMIs and leverage GitHub's ARM64 action runner. See below for more details.
@@ -105,7 +105,7 @@ The module uses the AWS System Manager Parameter Store to store configuration fo
105105
| ----------- | ----------- |
106106
| `ssm_paths.root/var.prefix?/app/` | App secrets used by Lambda's |
107107
| `ssm_paths.root/var.prefix?/runners/config/<name>` | Configuration parameters used by runner start script |
108-
| `ssm_paths.root/var.prefix?/runners/tokens/<ec2-instance-id>` | Registration tokens for the runners generated by the scale-up lambda, consumed by the start script on the runner. |
108+
| `ssm_paths.root/var.prefix?/runners/tokens/<ec2-instance-id>` | Either JIT configuration (ephemeral runners) or registration tokens (non ephemeral runners) generated by the control plane (scale-up lambda), and consumed by the start script on the runner to activate / register the runner.
109109

110110
Available configuration parameters:
111111

@@ -330,7 +330,7 @@ You can configure runners to be ephemeral, runners will be used only for one job
330330
- All events in the queue will lead to a new runner created by the lambda. By setting `enable_job_queued_check` to `true` you can enforce a rule of only creating a runner if the event has a correlated queued job. Setting this can avoid creating useless runners, for example when jobs got cancelled before a runner was created or if the job was already picked up by another runner. We suggest using this in combination with a pool.
331331
- To ensure runners are created in the same order GitHub sends the events, by default we use a FIFO queue. This is mainly relevant for repo level runners. For ephemeral runners you can set `enable_fifo_build_queue` to `false`.
332332
- Errors related to scaling should be retried via SQS. You can configure `job_queue_retention_in_seconds` and `redrive_build_queue` to tune the behavior. We have no mechanism to avoid events never being processed, which means potentially no runner gets created and the job in GitHub times out in 6 hours.
333-
333+
334334
The example for [ephemeral runners](./examples/ephemeral) is based on the [default example](./examples/default). Have look at the diff to see the major configuration differences.
335335

336336
### Prebuilt Images
@@ -438,7 +438,9 @@ In case the setup does not work as intended follow the trace of events:
438438

439439
## Security Considerations
440440

441-
This module creates resources in your AWS infrastructure, and EC2 instances for hosting the self-hosted runners on-demand. IAM permissions are set to a minimal level, and could be further limited by using permission boundaries. Instances permissions are limited to retrieve and delete the registration token, access the instance's own tags, and terminate the instance itself.
441+
This module creates resources in your AWS infrastructure, and EC2 instances for hosting the self-hosted runners on-demand. IAM permissions are set to a minimal level, and could be further limited by using permission boundaries. Instances permissions are limited to retrieve and delete the registration token, access the instance's own tags, and terminate the instance itself. By nature instances are short-lived, we strongly suggest to use ephemeral runners to ensure a safe build environment for each workflow job execution.
442+
443+
Ephemeral runners are using the JIT configuration, confguration that only can be used once to activate a runner. For non-ephemeral runners this option is not provided by GitHub. For non-ephemeeral runners a registration token is passed via SSM. After using the token, the token is deleted. But the token remains valid and is potential available in memory on the runner. For ephemeral runners this problem is avoid by using just in time tokens.
442444

443445
The examples are using standard AMI's for different operation systems. Instances are not hardened, and sudo operation are not blocked. To provide an out of the box working experience by default the module installs and configures the runner. However secrets are not hard coded, they finally end up in the memory of the instances. You can harden the instance by providing your own AMI and overwriting the cloud-init script.
444446

Diff for: examples/default/.terraform.lock.hcl

+4-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Diff for: examples/multi-runner/.terraform.lock.hcl

+4-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Diff for: examples/multi-runner/main.tf

+2-26
Original file line numberDiff line numberDiff line change
@@ -17,31 +17,8 @@ module "base" {
1717
}
1818

1919
module "multi-runner" {
20-
source = "../../modules/multi-runner"
21-
multi_runner_config = local.multi_runner_config
22-
# Alternative to loading runner configuration from Yaml files is using static configuration:
23-
# multi_runner_config = {
24-
# "linux-x64" = {
25-
# matcherConfig : {
26-
# labelMatchers = [["self-hosted", "linux", "x64", "amazon"]]
27-
# exactMatch = false
28-
# }
29-
# fifo = true
30-
# delay_webhook_event = 0
31-
# runner_config = {
32-
# runner_os = "linux"
33-
# runner_architecture = "x64"
34-
# runner_name_prefix = "amazon-x64_"
35-
# create_service_linked_role_spot = true
36-
# enable_ssm_on_runners = true
37-
# instance_types = ["m5ad.large", "m5a.large"]
38-
# runner_extra_labels = "amazon"
39-
# runners_maximum_count = 1
40-
# enable_ephemeral_runners = true
41-
# scale_down_schedule_expression = "cron(* * * * ? *)"
42-
# }
43-
# }
44-
# }
20+
source = "../../modules/multi-runner"
21+
multi_runner_config = local.multi_runner_config
4522
aws_region = local.aws_region
4623
vpc_id = module.base.vpc.vpc_id
4724
subnet_ids = module.base.vpc.private_subnets
@@ -68,5 +45,4 @@ module "multi-runner" {
6845

6946
# Enable debug logging for the lambda functions
7047
# log_level = "debug"
71-
7248
}

Diff for: lambdas/functions/control-plane/package.json

+2-2
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,8 @@
4141
"@aws-sdk/client-ec2": "^3.350.0",
4242
"@aws-sdk/types": "^3.347.0",
4343
"@octokit/auth-app": "4.0.13",
44-
"@octokit/rest": "^19.0.7",
45-
"@octokit/types": "^9.0.0",
44+
"@octokit/rest": "19.0.12",
45+
"@octokit/types": "^10.0.0",
4646
"@terraform-aws-github-runner/aws-powertools-util": "*",
4747
"@terraform-aws-github-runner/aws-ssm-util": "*",
4848
"cron-parser": "^4.8.1",

Diff for: lambdas/functions/control-plane/src/aws/runners.d.ts

-2
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,9 @@ export interface ListRunnerFilters {
2626
}
2727

2828
export interface RunnerInputParameters {
29-
runnerServiceConfig: string[];
3029
environment: string;
3130
runnerType: RunnerType;
3231
runnerOwner: string;
33-
ssmTokenPath: string;
3432
subnets: string[];
3533
launchTemplateName: string;
3634
ec2instanceCriteria: {

0 commit comments

Comments
 (0)