Skip to content
This repository was archived by the owner on Jan 16, 2025. It is now read-only.

Commit 340deea

Browse files
npalmgithub-actions[bot]ScottGuymer
authored
feat: SSM housekeeper (#3577)
# Description The runner module uses SSM to provide the JIT config or token to the runner. In case the runner does not start healthy the SSM parameter is not deleted. This PR adds a Lambda to remove by default SSM paramaters in the token path that are older then a day. The lambda will be deployed by default as part of the control plane and manage the tokens in the path used by the scale-up runner function. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Scott Guymer <[email protected]>
1 parent f38f20a commit 340deea

20 files changed

+460
-14
lines changed

Diff for: README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ To be able to support a number of use-cases the module has quite a lot of config
9797

9898
### AWS SSM Parameters
9999

100-
The module uses the AWS System Manager Parameter Store to store configuration for the runners, as well as registration tokens and secrets for the Lambdas. Paths for the parameters can be configured via the variable `ssm_paths`. The location of the configuration parameters is retrieved by the runners via the instance tag `ghr:ssm_config_path`. The following default paths will be used.
100+
The module uses the AWS System Manager Parameter Store to store configuration for the runners, as well as registration tokens and secrets for the Lambdas. Paths for the parameters can be configured via the variable `ssm_paths`. The location of the configuration parameters is retrieved by the runners via the instance tag `ghr:ssm_config_path`. The following default paths will be used. Tokens or JIT config stored in the token path will be deleted after retrieval by instance, data not deleted after a day will be deleted by a SSM housekeeper lambda.
101101

102102
| Path | Description |
103103
| ----------- | ----------- |
@@ -585,6 +585,7 @@ We welcome any improvement to the standard module to make the default as secure
585585
| <a name="input_runners_maximum_count"></a> [runners\_maximum\_count](#input\_runners\_maximum\_count) | The maximum number of runners that will be created. | `number` | `3` | no |
586586
| <a name="input_runners_scale_down_lambda_timeout"></a> [runners\_scale\_down\_lambda\_timeout](#input\_runners\_scale\_down\_lambda\_timeout) | Time out for the scale down lambda in seconds. | `number` | `60` | no |
587587
| <a name="input_runners_scale_up_lambda_timeout"></a> [runners\_scale\_up\_lambda\_timeout](#input\_runners\_scale\_up\_lambda\_timeout) | Time out for the scale up lambda in seconds. | `number` | `30` | no |
588+
| <a name="input_runners_ssm_housekeeper"></a> [runners\_ssm\_housekeeper](#input\_runners\_ssm\_housekeeper) | Configuration for the SSM housekeeper lambda. This lambda deletes token / JIT config from SSM.<br><br> `schedule_expression`: is used to configure the schedule for the lambda.<br> `enabled`: enable or disable the lambda trigger via the EventBridge.<br> `lambda_timeout`: timeout for the lambda in seconds.<br> `config`: configuration for the lambda function. Token path will be read by default from the module. | <pre>object({<br> schedule_expression = optional(string, "rate(1 day)")<br> enabled = optional(bool, true)<br> lambda_timeout = optional(number, 60)<br> config = object({<br> tokenPath = optional(string)<br> minimumDaysOld = optional(number, 1)<br> dryRun = optional(bool, false)<br> })<br> })</pre> | <pre>{<br> "config": {}<br>}</pre> | no |
588589
| <a name="input_scale_down_schedule_expression"></a> [scale\_down\_schedule\_expression](#input\_scale\_down\_schedule\_expression) | Scheduler expression to check every x for scale down. | `string` | `"cron(*/5 * * * ? *)"` | no |
589590
| <a name="input_scale_up_reserved_concurrent_executions"></a> [scale\_up\_reserved\_concurrent\_executions](#input\_scale\_up\_reserved\_concurrent\_executions) | Amount of reserved concurrent executions for the scale-up lambda function. A value of 0 disables lambda from being triggered and -1 removes any concurrency limitations. | `number` | `1` | no |
590591
| <a name="input_ssm_paths"></a> [ssm\_paths](#input\_ssm\_paths) | The root path used in SSM to store configuration and secrets. | <pre>object({<br> root = optional(string, "github-action-runners")<br> app = optional(string, "app")<br> runners = optional(string, "runners")<br> use_prefix = optional(bool, true)<br> })</pre> | `{}` | no |

Diff for: lambdas/functions/control-plane/jest.config.ts

+4-4
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ const config: Config = {
66
...defaultConfig,
77
coverageThreshold: {
88
global: {
9-
statements: 97.6,
10-
branches: 94.6,
11-
functions: 97,
12-
lines: 98,
9+
statements: 97.89,
10+
branches: 94.64,
11+
functions: 97.33,
12+
lines: 98.21,
1313
},
1414
},
1515
};

Diff for: lambdas/functions/control-plane/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"test": "NODE_ENV=test jest",
99
"test:watch": "NODE_ENV=test jest --watch",
1010
"lint": "yarn eslint src",
11-
"watch": "ts-node-dev --respawn --exit-child src/local.ts",
11+
"watch": "ts-node-dev --respawn --exit-child src/local-ssm-housekeeper.ts",
1212
"build": "ncc build src/lambda.ts -o dist",
1313
"dist": "yarn build && cd dist && zip ../runners.zip index.js",
1414
"format": "prettier --write \"**/*.ts\"",

Diff for: lambdas/functions/control-plane/src/lambda.test.ts

+33-6
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@ import { logger } from '@terraform-aws-github-runner/aws-powertools-util';
22
import { Context, SQSEvent, SQSRecord } from 'aws-lambda';
33
import { mocked } from 'jest-mock';
44

5-
import { adjustPool, scaleDownHandler, scaleUpHandler } from './lambda';
5+
import { adjustPool, scaleDownHandler, scaleUpHandler, ssmHousekeeper } from './lambda';
66
import { adjust } from './pool/pool';
77
import ScaleError from './scale-runners/ScaleError';
88
import { scaleDown } from './scale-runners/scale-down';
99
import { ActionRequestMessage, scaleUp } from './scale-runners/scale-up';
10+
import { cleanSSMTokens } from './scale-runners/ssm-housekeeper';
1011

1112
const body: ActionRequestMessage = {
1213
eventType: 'workflow_job',
@@ -61,6 +62,7 @@ const context: Context = {
6162
jest.mock('./scale-runners/scale-up');
6263
jest.mock('./scale-runners/scale-down');
6364
jest.mock('./pool/pool');
65+
jest.mock('./scale-runners/ssm-housekeeper');
6466
jest.mock('@terraform-aws-github-runner/aws-powertools-util');
6567

6668
// Docs for testing async with jest: https://jestjs.io/docs/tutorial-async
@@ -87,7 +89,7 @@ describe('Test scale up lambda wrapper.', () => {
8789
const error = new Error('Non scale should resolve.');
8890
const mock = mocked(scaleUp);
8991
mock.mockRejectedValue(error);
90-
await expect(scaleUpHandler(sqsEvent, context)).resolves;
92+
await expect(scaleUpHandler(sqsEvent, context)).resolves.not.toThrow;
9193
});
9294

9395
it('Scale should be rejected', async () => {
@@ -110,7 +112,7 @@ async function testInvalidRecords(sqsRecords: SQSRecord[]) {
110112
Records: sqsRecords,
111113
};
112114

113-
await expect(scaleUpHandler(sqsEventMultipleRecords, context)).resolves;
115+
await expect(scaleUpHandler(sqsEventMultipleRecords, context)).resolves.not.toThrow();
114116

115117
expect(logWarnSpy).toHaveBeenCalledWith(
116118
expect.stringContaining(
@@ -127,14 +129,14 @@ describe('Test scale down lambda wrapper.', () => {
127129
resolve();
128130
});
129131
});
130-
await expect(scaleDownHandler({}, context)).resolves;
132+
await expect(scaleDownHandler({}, context)).resolves.not.toThrow();
131133
});
132134

133135
it('Scaling down with error.', async () => {
134136
const error = new Error('Scaling down with error.');
135137
const mock = mocked(scaleDown);
136138
mock.mockRejectedValue(error);
137-
await expect(await scaleDownHandler({}, context)).resolves;
139+
await expect(scaleDownHandler({}, context)).resolves.not.toThrow();
138140
});
139141
});
140142

@@ -146,7 +148,7 @@ describe('Adjust pool.', () => {
146148
resolve();
147149
});
148150
});
149-
await expect(adjustPool({ poolSize: 2 }, context)).resolves;
151+
await expect(adjustPool({ poolSize: 2 }, context)).resolves.not.toThrow();
150152
});
151153

152154
it('Handle error for adjusting pool.', async () => {
@@ -158,3 +160,28 @@ describe('Adjust pool.', () => {
158160
expect(logSpy).lastCalledWith(expect.stringContaining(error.message), expect.anything());
159161
});
160162
});
163+
164+
describe('Test ssm housekeeper lambda wrapper.', () => {
165+
it('Invoke without errors.', async () => {
166+
const mock = mocked(cleanSSMTokens);
167+
mock.mockImplementation(() => {
168+
return new Promise((resolve) => {
169+
resolve();
170+
});
171+
});
172+
173+
process.env.SSM_CLEANUP_CONFIG = JSON.stringify({
174+
dryRun: false,
175+
minimumDaysOld: 1,
176+
tokenPath: '/path/to/tokens/',
177+
});
178+
179+
await expect(ssmHousekeeper({}, context)).resolves.not.toThrow();
180+
});
181+
182+
it('Errors not throwed.', async () => {
183+
const mock = mocked(cleanSSMTokens);
184+
mock.mockRejectedValue(new Error());
185+
await expect(ssmHousekeeper({}, context)).resolves.not.toThrow();
186+
});
187+
});

Diff for: lambdas/functions/control-plane/src/lambda.ts

+13
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ import { PoolEvent, adjust } from './pool/pool';
66
import ScaleError from './scale-runners/ScaleError';
77
import { scaleDown } from './scale-runners/scale-down';
88
import { scaleUp } from './scale-runners/scale-up';
9+
import { SSMCleanupOptions, cleanSSMTokens } from './scale-runners/ssm-housekeeper';
910

1011
export async function scaleUpHandler(event: SQSEvent, context: Context): Promise<void> {
1112
setContext(context, 'lambda.ts');
@@ -48,3 +49,15 @@ export async function adjustPool(event: PoolEvent, context: Context): Promise<vo
4849
logger.error(`${(e as Error).message}`, { error: e as Error });
4950
}
5051
}
52+
53+
export async function ssmHousekeeper(event: unknown, context: Context): Promise<void> {
54+
setContext(context, 'lambda.ts');
55+
logger.logEventIfEnabled(event);
56+
const config = JSON.parse(process.env.SSM_CLEANUP_CONFIG) as SSMCleanupOptions;
57+
58+
try {
59+
await cleanSSMTokens(config);
60+
} catch (e) {
61+
logger.error(`${(e as Error).message}`, { error: e as Error });
62+
}
63+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import { cleanSSMTokens } from './scale-runners/ssm-housekeeper';
2+
3+
export function run(): void {
4+
cleanSSMTokens({
5+
dryRun: true,
6+
minimumDaysOld: 3,
7+
tokenPath: '/ghr/my-env/runners/tokens',
8+
})
9+
.then()
10+
.catch((e) => {
11+
console.log(e);
12+
});
13+
}
14+
15+
run();

Diff for: lambdas/functions/control-plane/src/modules.d.ts

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ declare namespace NodeJS {
1414
RUNNER_OWNER: string;
1515
SCALE_DOWN_CONFIG: string;
1616
SSM_TOKEN_PATH: string;
17+
SSM_CLEANUP_CONFIG: string;
1718
SUBNET_IDS: string;
1819
INSTANCE_TYPES: string;
1920
INSTANCE_TARGET_CAPACITY_TYPE: 'on-demand' | 'spot';
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
import { DeleteParameterCommand, GetParametersByPathCommand, SSMClient } from '@aws-sdk/client-ssm';
2+
import { mockClient } from 'aws-sdk-client-mock';
3+
import 'aws-sdk-client-mock-jest';
4+
import { cleanSSMTokens } from './ssm-housekeeper';
5+
6+
process.env.AWS_REGION = 'eu-east-1';
7+
8+
const mockSSMClient = mockClient(SSMClient);
9+
10+
const deleteAmisOlderThenDays = 1;
11+
const now = new Date();
12+
const dateOld = new Date();
13+
dateOld.setDate(dateOld.getDate() - deleteAmisOlderThenDays - 1);
14+
15+
const tokenPath = '/path/to/tokens/';
16+
17+
describe('clean SSM tokens / JIT config', () => {
18+
beforeEach(() => {
19+
mockSSMClient.reset();
20+
mockSSMClient.on(GetParametersByPathCommand).resolves({
21+
Parameters: undefined,
22+
});
23+
mockSSMClient.on(GetParametersByPathCommand, { Path: tokenPath }).resolves({
24+
Parameters: [
25+
{
26+
Name: tokenPath + 'i-old-01',
27+
LastModifiedDate: dateOld,
28+
},
29+
],
30+
NextToken: 'next',
31+
});
32+
mockSSMClient.on(GetParametersByPathCommand, { Path: tokenPath, NextToken: 'next' }).resolves({
33+
Parameters: [
34+
{
35+
Name: tokenPath + 'i-new-01',
36+
LastModifiedDate: now,
37+
},
38+
],
39+
NextToken: undefined,
40+
});
41+
});
42+
43+
it('should delete parameters older then minimumDaysOld', async () => {
44+
await cleanSSMTokens({
45+
dryRun: false,
46+
minimumDaysOld: deleteAmisOlderThenDays,
47+
tokenPath: tokenPath,
48+
});
49+
50+
expect(mockSSMClient).toHaveReceivedCommandWith(GetParametersByPathCommand, { Path: tokenPath });
51+
expect(mockSSMClient).toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-old-01' });
52+
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-new-01' });
53+
});
54+
55+
it('should not delete when dry run is activated', async () => {
56+
await cleanSSMTokens({
57+
dryRun: true,
58+
minimumDaysOld: deleteAmisOlderThenDays,
59+
tokenPath: tokenPath,
60+
});
61+
62+
expect(mockSSMClient).toHaveReceivedCommandWith(GetParametersByPathCommand, { Path: tokenPath });
63+
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-old-01' });
64+
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-new-01' });
65+
});
66+
67+
it('should not call delete when no parameters are found.', async () => {
68+
await expect(
69+
cleanSSMTokens({
70+
dryRun: false,
71+
minimumDaysOld: deleteAmisOlderThenDays,
72+
tokenPath: 'no-exist',
73+
}),
74+
).resolves.not.toThrow();
75+
76+
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-old-01' });
77+
expect(mockSSMClient).not.toHaveReceivedCommandWith(DeleteParameterCommand, { Name: tokenPath + 'i-new-01' });
78+
});
79+
80+
it('should not error on delete failure.', async () => {
81+
mockSSMClient.on(DeleteParameterCommand).rejects(new Error('ParameterNotFound'));
82+
83+
await expect(
84+
cleanSSMTokens({
85+
dryRun: false,
86+
minimumDaysOld: deleteAmisOlderThenDays,
87+
tokenPath: tokenPath,
88+
}),
89+
).resolves.not.toThrow();
90+
});
91+
92+
it('should only accept valid options.', async () => {
93+
await expect(
94+
cleanSSMTokens({
95+
dryRun: false,
96+
minimumDaysOld: undefined as unknown as number,
97+
tokenPath: tokenPath,
98+
}),
99+
).rejects.toBeInstanceOf(Error);
100+
101+
await expect(
102+
cleanSSMTokens({
103+
dryRun: false,
104+
minimumDaysOld: 0,
105+
tokenPath: tokenPath,
106+
}),
107+
).rejects.toBeInstanceOf(Error);
108+
109+
await expect(
110+
cleanSSMTokens({
111+
dryRun: false,
112+
minimumDaysOld: 1,
113+
tokenPath: undefined as unknown as string,
114+
}),
115+
).rejects.toBeInstanceOf(Error);
116+
});
117+
});
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
import { DeleteParameterCommand, GetParametersByPathCommand, SSMClient } from '@aws-sdk/client-ssm';
2+
import { logger } from '@terraform-aws-github-runner/aws-powertools-util';
3+
4+
export interface SSMCleanupOptions {
5+
dryRun: boolean;
6+
minimumDaysOld: number;
7+
tokenPath: string;
8+
}
9+
10+
function validateOptions(options: SSMCleanupOptions): void {
11+
const errorMessages: string[] = [];
12+
if (!options.minimumDaysOld || options.minimumDaysOld < 1) {
13+
errorMessages.push(`minimumDaysOld must be greater then 0, value is set to "${options.minimumDaysOld}"`);
14+
}
15+
if (!options.tokenPath) {
16+
errorMessages.push('tokenPath must be defined');
17+
}
18+
if (errorMessages.length > 0) {
19+
throw new Error(errorMessages.join(', '));
20+
}
21+
}
22+
23+
export async function cleanSSMTokens(options: SSMCleanupOptions): Promise<void> {
24+
logger.info(`Cleaning tokens / JIT config older then ${options.minimumDaysOld} days, dryRun: ${options.dryRun}`);
25+
logger.debug('Cleaning with options', { options });
26+
validateOptions(options);
27+
28+
const client = new SSMClient({ region: process.env.AWS_REGION });
29+
const parameters = await client.send(new GetParametersByPathCommand({ Path: options.tokenPath }));
30+
while (parameters.NextToken) {
31+
const nextParameters = await client.send(
32+
new GetParametersByPathCommand({ Path: options.tokenPath, NextToken: parameters.NextToken }),
33+
);
34+
parameters.Parameters?.push(...(nextParameters.Parameters ?? []));
35+
parameters.NextToken = nextParameters.NextToken;
36+
}
37+
logger.info(`Found #${parameters.Parameters?.length} parameters in path ${options.tokenPath}`);
38+
logger.debug('Found parameters', { parameters });
39+
40+
// minimumDate = today - minimumDaysOld
41+
const minimumDate = new Date();
42+
minimumDate.setDate(minimumDate.getDate() - options.minimumDaysOld);
43+
44+
for (const parameter of parameters.Parameters ?? []) {
45+
if (parameter.LastModifiedDate && new Date(parameter.LastModifiedDate) < minimumDate) {
46+
logger.info(`Deleting parameter ${parameter.Name} with last modified date ${parameter.LastModifiedDate}`);
47+
try {
48+
if (!options.dryRun) {
49+
// sleep 50ms to avoid rait limit
50+
await new Promise((resolve) => setTimeout(resolve, 50));
51+
await client.send(new DeleteParameterCommand({ Name: parameter.Name }));
52+
}
53+
} catch (e) {
54+
logger.warn(`Failed to delete parameter ${parameter.Name} with error ${(e as Error).message}`);
55+
logger.debug('Failed to delete parameter', { e });
56+
}
57+
} else {
58+
logger.debug(`Skipping parameter ${parameter.Name} with last modified date ${parameter.LastModifiedDate}`);
59+
}
60+
}
61+
}

Diff for: main.tf

+2-1
Original file line numberDiff line numberDiff line change
@@ -277,6 +277,8 @@ module "runners" {
277277
pool_lambda_timeout = var.pool_lambda_timeout
278278
pool_runner_owner = var.pool_runner_owner
279279
pool_lambda_reserved_concurrent_executions = var.pool_lambda_reserved_concurrent_executions
280+
281+
ssm_housekeeper = var.runners_ssm_housekeeper
280282
}
281283

282284
module "runner_binaries" {
@@ -318,7 +320,6 @@ module "runner_binaries" {
318320
lambda_security_group_ids = var.lambda_security_group_ids
319321
aws_partition = var.aws_partition
320322

321-
322323
lambda_principals = var.lambda_principals
323324
}
324325

Diff for: modules/multi-runner/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,7 @@ module "multi-runner" {
166166
| <a name="input_runners_lambda_zip"></a> [runners\_lambda\_zip](#input\_runners\_lambda\_zip) | File location of the lambda zip file for scaling runners. | `string` | `null` | no |
167167
| <a name="input_runners_scale_down_lambda_timeout"></a> [runners\_scale\_down\_lambda\_timeout](#input\_runners\_scale\_down\_lambda\_timeout) | Time out for the scale down lambda in seconds. | `number` | `60` | no |
168168
| <a name="input_runners_scale_up_lambda_timeout"></a> [runners\_scale\_up\_lambda\_timeout](#input\_runners\_scale\_up\_lambda\_timeout) | Time out for the scale up lambda in seconds. | `number` | `30` | no |
169+
| <a name="input_runners_ssm_housekeeper"></a> [runners\_ssm\_housekeeper](#input\_runners\_ssm\_housekeeper) | Configuration for the SSM housekeeper lambda. This lambda deletes token / JIT config from SSM.<br><br> `schedule_expression`: is used to configure the schedule for the lambda.<br> `enabled`: enable or disable the lambda trigger via the EventBridge.<br> `lambda_timeout`: timeout for the lambda in seconds.<br> `config`: configuration for the lambda function. Token path will be read by default from the module. | <pre>object({<br> schedule_expression = optional(string, "rate(1 day)")<br> enabled = optional(bool, true)<br> lambda_timeout = optional(number, 60)<br> config = object({<br> tokenPath = optional(string)<br> minimumDaysOld = optional(number, 1)<br> dryRun = optional(bool, false)<br> })<br> })</pre> | <pre>{<br> "config": {}<br>}</pre> | no |
169170
| <a name="input_ssm_paths"></a> [ssm\_paths](#input\_ssm\_paths) | The root path used in SSM to store configuration and secreets. | <pre>object({<br> root = optional(string, "github-action-runners")<br> app = optional(string, "app")<br> runners = optional(string, "runners")<br> })</pre> | `{}` | no |
170171
| <a name="input_subnet_ids"></a> [subnet\_ids](#input\_subnet\_ids) | List of subnets in which the action runners will be launched, the subnets needs to be subnets in the `vpc_id`. | `list(string)` | n/a | yes |
171172
| <a name="input_syncer_lambda_s3_key"></a> [syncer\_lambda\_s3\_key](#input\_syncer\_lambda\_s3\_key) | S3 key for syncer lambda function. Required if using S3 bucket to specify lambdas. | `string` | `null` | no |

Diff for: modules/multi-runner/runners.tf

+2
Original file line numberDiff line numberDiff line change
@@ -104,4 +104,6 @@ module "runners" {
104104
pool_runner_owner = each.value.runner_config.pool_runner_owner
105105
pool_lambda_reserved_concurrent_executions = var.pool_lambda_reserved_concurrent_executions
106106
associate_public_ipv4_address = var.associate_public_ipv4_address
107+
108+
ssm_housekeeper = var.runners_ssm_housekeeper
107109
}

0 commit comments

Comments
 (0)