Skip to content
This repository was archived by the owner on Jan 16, 2025. It is now read-only.

Commit 970e8a6

Browse files
GuptaNavdeep1983github-actions[bot]npalm
authored
feat: added changes to enable tracing in lambdas. (#3554)
This PR addresses the need to enable tracing for the lambdas used in the runners architecture: # Highlights: - This feature enables the tracing in all lambdas which allow to debug/investigate any issues that arise out of day-to-day use of runners infrastructure. - If user decides to add all the features provisioned in the PR, user should be able to find the complete linked trace between the time a webhook is triggered with workflow job event to the API gateway endpoint to the execution of scale up lambda which creates a new runner to fulfill the need of creating a new runner and also find the relevant logs linked to the trace in **AWS CloudWatch ServiceLens**. As of result, user need not navigate to various log groups to find any issue in any given service. - Please find the X-ray costing in this [link](https://aws.amazon.com/xray/pricing/) detailing the cost involved in enabling this feature. # Additions: - [x] Provide an option to enable traces in EC2 bash script which allows to find and link any issues that may arise out of starting the runner and find this information linked in the trace created out of this feature. # Options: Use Cloudwatch config agent which now supports to capture traces ([link](https://aws.amazon.com/about-aws/whats-new/2023/08/amazon-cloudwatch-agent-opentelemetry-traces-x-ray/)) can be used to capture traces and link them to the log groups. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niek Palm <[email protected]>
1 parent d85511a commit 970e8a6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+531
-197
lines changed

Diff for: README.md

+14-1
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ This [Terraform](https://www.terraform.io/) module creates the required infrastr
3232
- [Examples](#examples)
3333
- [Sub modules](#sub-modules)
3434
- [Logging](#logging)
35+
- [Tracing](#tracing)
3536
- [Debugging](#debugging)
3637
- [Security Considerations](#security-considerations)
3738
- [Requirements](#requirements)
@@ -427,6 +428,17 @@ An example log message of the scale-up function:
427428
}
428429
}
429430
```
431+
## Tracing
432+
For the distributed architecture of this application it can be difficult to troubleshoot this application.
433+
We support the option to enable tracing for all the lambda functions created by this application. To enable tracing user can simply provide the `tracing_config` option inside the root module or inner modules.
434+
435+
This tracing config generates timelines for following events:
436+
- Basic lifecycle of lambda function
437+
- Traces for Github API calls (can be configured by capture_http_requests).
438+
- Traces for all AWS SDK calls
439+
440+
This feature has been disabled by default.
441+
430442

431443
## Debugging
432444

@@ -543,7 +555,7 @@ We welcome any improvement to the standard module to make the default as secure
543555
| <a name="input_lambda_s3_bucket"></a> [lambda\_s3\_bucket](#input\_lambda\_s3\_bucket) | S3 bucket from which to specify lambda functions. This is an alternative to providing local files directly. | `string` | `null` | no |
544556
| <a name="input_lambda_security_group_ids"></a> [lambda\_security\_group\_ids](#input\_lambda\_security\_group\_ids) | List of security group IDs associated with the Lambda function. | `list(string)` | `[]` | no |
545557
| <a name="input_lambda_subnet_ids"></a> [lambda\_subnet\_ids](#input\_lambda\_subnet\_ids) | List of subnets in which the action runners will be launched, the subnets needs to be subnets in the `vpc_id`. | `list(string)` | `[]` | no |
546-
| <a name="input_lambda_tracing_mode"></a> [lambda\_tracing\_mode](#input\_lambda\_tracing\_mode) | Enable X-Ray tracing for the lambda functions. | `string` | `null` | no |
558+
| <a name="input_lambda_tracing_mode"></a> [lambda\_tracing\_mode](#input\_lambda\_tracing\_mode) | DEPRECATED: Replaced by `tracing_config`. | `string` | `null` | no |
547559
| <a name="input_log_level"></a> [log\_level](#input\_log\_level) | Logging level for lambda logging. Valid values are 'silly', 'trace', 'debug', 'info', 'warn', 'error', 'fatal'. | `string` | `"info"` | no |
548560
| <a name="input_logging_kms_key_id"></a> [logging\_kms\_key\_id](#input\_logging\_kms\_key\_id) | Specifies the kms key id to encrypt the logs with. | `string` | `null` | no |
549561
| <a name="input_logging_retention_in_days"></a> [logging\_retention\_in\_days](#input\_logging\_retention\_in\_days) | Specifies the number of days you want to retain log events for the lambda log group. Possible values are: 0, 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1827, and 3653. | `number` | `180` | no |
@@ -593,6 +605,7 @@ We welcome any improvement to the standard module to make the default as secure
593605
| <a name="input_syncer_lambda_s3_key"></a> [syncer\_lambda\_s3\_key](#input\_syncer\_lambda\_s3\_key) | S3 key for syncer lambda function. Required if using an S3 bucket to specify lambdas. | `string` | `null` | no |
594606
| <a name="input_syncer_lambda_s3_object_version"></a> [syncer\_lambda\_s3\_object\_version](#input\_syncer\_lambda\_s3\_object\_version) | S3 object version for syncer lambda function. Useful if S3 versioning is enabled on source bucket. | `string` | `null` | no |
595607
| <a name="input_tags"></a> [tags](#input\_tags) | Map of tags that will be added to created resources. By default resources will be tagged with name and environment. | `map(string)` | `{}` | no |
608+
| <a name="input_tracing_config"></a> [tracing\_config](#input\_tracing\_config) | Configuration for lambda tracing. | <pre>object({<br> mode = optional(string, null)<br> capture_http_requests = optional(bool, false)<br> capture_error = optional(bool, false)<br> })</pre> | `{}` | no |
596609
| <a name="input_userdata_post_install"></a> [userdata\_post\_install](#input\_userdata\_post\_install) | Script to be ran after the GitHub Actions runner is installed on the EC2 instances | `string` | `""` | no |
597610
| <a name="input_userdata_pre_install"></a> [userdata\_pre\_install](#input\_userdata\_pre\_install) | Script to be ran before the GitHub Actions runner is installed on the EC2 instances | `string` | `""` | no |
598611
| <a name="input_userdata_template"></a> [userdata\_template](#input\_userdata\_template) | Alternative user-data template, replacing the default template. By providing your own user\_data you have to take care of installing all required software, including the action runner. Variables userdata\_pre/post\_install are ignored. | `string` | `null` | no |

Diff for: examples/ephemeral/main.tf

+12-3
Original file line numberDiff line numberDiff line change
@@ -69,11 +69,20 @@ module "runners" {
6969
#
7070
enable_job_queued_check = true
7171

72+
# tracing_config = {
73+
# mode = "Active"
74+
# capture_error = true
75+
# capture_http_requests = true
76+
# }
77+
78+
7279
# configure your pre-built AMI
7380
# enable_userdata = false
74-
# ami_filter = { name = ["github-runner-al2023-x86_64-*"], state = ["available"] }
75-
# data "aws_caller_identity" "current" {}
76-
# ami_owners = [data.aws_caller_identity.current.account_id]
81+
# ami_filter = { name = ["github-runner-al2023-x86_64-*"], state = ["available"] }
82+
# ami_owners = [data.aws_caller_identity.current.account_id]
83+
84+
# or use the default AMI
85+
# enable_userdata = true
7786

7887
# Enable debug logging for the lambda functions
7988
# log_level = "debug"

Diff for: examples/multi-runner/main.tf

+6
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,12 @@ module "runners" {
5757
id = var.github_app.id
5858
webhook_secret = random_id.random.hex
5959
}
60+
# enable this section for tracing
61+
# tracing_config = {
62+
# mode = "Active"
63+
# capture_error = true
64+
# capture_http_requests = true
65+
# }
6066
# Assuming local build lambda's to use pre build ones, uncomment the lines below and download the
6167
# lambda zip files lambda_download
6268
# webhook_lambda_zip = "../lambdas-download/webhook.zip"

Diff for: lambdas/functions/ami-housekeeper/src/ami.ts

+5-4
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import {
1010
} from '@aws-sdk/client-ec2';
1111
import { DescribeParametersCommand, GetParameterCommand, SSMClient } from '@aws-sdk/client-ssm';
1212
import { createChildLogger } from '@terraform-aws-github-runner/aws-powertools-util';
13+
import { getTracedAWSV3Client } from '@terraform-aws-github-runner/aws-powertools-util';
1314

1415
const logger = createChildLogger('ami');
1516

@@ -82,7 +83,7 @@ async function getAmisNotInUse(options: AmiCleanupOptions) {
8283
const amiIdsInSSM = await getAmisReferedInSSM(options);
8384
const amiIdsInTemplates = await getAmiInLatestTemplates(options);
8485

85-
const ec2Client = new EC2Client({});
86+
const ec2Client = getTracedAWSV3Client(new EC2Client({}));
8687
logger.debug('Getting all AMIs from ec2 with filters', { filters: options.amiFilters });
8788
const amiEc2 = await ec2Client.send(
8889
new DescribeImagesCommand({
@@ -133,7 +134,7 @@ async function deleteAmi(amiDetails: Image, options: AmiCleanupOptionsInternal):
133134

134135
try {
135136
logger.info(`deleting ami ${amiDetails.Name || amiDetails.ImageId} created at ${amiDetails.CreationDate}`);
136-
const ec2Client = new EC2Client({});
137+
const ec2Client = getTracedAWSV3Client(new EC2Client({}));
137138
await ec2Client.send(new DeregisterImageCommand({ ImageId: amiDetails.ImageId, DryRun: options.dryRun }));
138139
await deleteSnapshot(options, amiDetails, ec2Client);
139140
} catch (error) {
@@ -158,7 +159,7 @@ async function deleteSnapshot(options: AmiCleanupOptions, amiDetails: Image, ec2
158159
}
159160

160161
async function getAmiInLatestTemplates(options: AmiCleanupOptions): Promise<(string | undefined)[]> {
161-
const ec2Client = new EC2Client({});
162+
const ec2Client = getTracedAWSV3Client(new EC2Client({}));
162163
const launnchTemplates = await ec2Client.send(
163164
new DescribeLaunchTemplatesCommand({
164165
LaunchTemplateNames: options.launchTemplateNames,
@@ -188,7 +189,7 @@ async function getAmisReferedInSSM(options: AmiCleanupOptions): Promise<(string
188189
return [];
189190
}
190191

191-
const ssmClient = new SSMClient({});
192+
const ssmClient = getTracedAWSV3Client(new SSMClient({}));
192193
const ssmParams = await ssmClient.send(
193194
new DescribeParametersCommand({
194195
ParameterFilters: [

Diff for: lambdas/functions/control-plane/package.json

+2
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,13 @@
4040
"dependencies": {
4141
"@aws-sdk/client-ec2": "^3.436.0",
4242
"@aws-sdk/types": "^3.433.0",
43+
"@middy/core": "^3.6.2",
4344
"@octokit/auth-app": "6.0.1",
4445
"@octokit/rest": "20.0.2",
4546
"@octokit/types": "^12.1.1",
4647
"@terraform-aws-github-runner/aws-powertools-util": "*",
4748
"@terraform-aws-github-runner/aws-ssm-util": "*",
49+
"axios": "^1.5.1",
4850
"cron-parser": "^4.8.1",
4951
"typescript": "^5.0.4"
5052
}

Diff for: lambdas/functions/control-plane/src/aws/runners.d.ts

+1
Original file line numberDiff line numberDiff line change
@@ -39,4 +39,5 @@ export interface RunnerInputParameters {
3939
};
4040
numberOfRunners?: number;
4141
amiIdSsmParameterName?: string;
42+
tracingEnabled?: boolean;
4243
}

Diff for: lambdas/functions/control-plane/src/aws/runners.test.ts

+17
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import {
1010
TerminateInstancesCommand,
1111
} from '@aws-sdk/client-ec2';
1212
import { GetParameterCommand, GetParameterResult, PutParameterCommand, SSMClient } from '@aws-sdk/client-ssm';
13+
import { tracer } from '@terraform-aws-github-runner/aws-powertools-util';
1314
import { mockClient } from 'aws-sdk-client-mock';
1415
import 'aws-sdk-client-mock-jest';
1516

@@ -236,6 +237,15 @@ describe('create runner', () => {
236237
Name: 'my-ami-id-param',
237238
});
238239
});
240+
it('calls create fleet of 1 instance with runner tracing enabled', async () => {
241+
tracer.getRootXrayTraceId = jest.fn().mockReturnValue('123');
242+
243+
await createRunner(createRunnerConfig({ ...defaultRunnerConfig, tracingEnabled: true }));
244+
245+
expect(mockEC2Client).toHaveReceivedCommandWith(CreateFleetCommand, {
246+
...expectedCreateFleetRequest({ ...defaultExpectedFleetRequestValues, tracingEnabled: true }),
247+
});
248+
});
239249
});
240250

241251
describe('create runner with errors', () => {
@@ -350,6 +360,7 @@ interface RunnerConfig {
350360
allocationStrategy: SpotAllocationStrategy;
351361
maxSpotPrice?: string;
352362
amiIdSsmParameterName?: string;
363+
tracingEnabled?: boolean;
353364
}
354365

355366
function createRunnerConfig(runnerConfig: RunnerConfig): RunnerInputParameters {
@@ -366,6 +377,7 @@ function createRunnerConfig(runnerConfig: RunnerConfig): RunnerInputParameters {
366377
},
367378
subnets: ['subnet-123', 'subnet-456'],
368379
amiIdSsmParameterName: runnerConfig.amiIdSsmParameterName,
380+
tracingEnabled: runnerConfig.tracingEnabled,
369381
};
370382
}
371383

@@ -376,6 +388,7 @@ interface ExpectedFleetRequestValues {
376388
maxSpotPrice?: string;
377389
totalTargetCapacity: number;
378390
imageId?: string;
391+
tracingEnabled?: boolean;
379392
}
380393

381394
function expectedCreateFleetRequest(expectedValues: ExpectedFleetRequestValues): CreateFleetCommandInput {
@@ -385,6 +398,10 @@ function expectedCreateFleetRequest(expectedValues: ExpectedFleetRequestValues):
385398
{ Key: 'ghr:Type', Value: expectedValues.type },
386399
{ Key: 'ghr:Owner', Value: REPO_NAME },
387400
];
401+
if (expectedValues.tracingEnabled) {
402+
const traceId = tracer.getRootXrayTraceId();
403+
tags.push({ Key: 'ghr:trace_id', Value: traceId! });
404+
}
388405
const request: CreateFleetCommandInput = {
389406
LaunchTemplateConfigs: [
390407
{

Diff for: lambdas/functions/control-plane/src/aws/runners.ts

+10-3
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import {
99
_InstanceType,
1010
} from '@aws-sdk/client-ec2';
1111
import { createChildLogger } from '@terraform-aws-github-runner/aws-powertools-util';
12+
import { getTracedAWSV3Client, tracer } from '@terraform-aws-github-runner/aws-powertools-util';
1213
import { getParameter } from '@terraform-aws-github-runner/aws-ssm-util';
1314
import moment from 'moment';
1415

@@ -56,7 +57,7 @@ function constructFilters(filters?: Runners.ListRunnerFilters): Ec2Filter[][] {
5657
}
5758

5859
async function getRunners(ec2Filters: Ec2Filter[]): Promise<Runners.RunnerList[]> {
59-
const ec2 = new EC2Client({ region: process.env.AWS_REGION });
60+
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
6061
const runners: Runners.RunnerList[] = [];
6162
let nextToken;
6263
let hasNext = true;
@@ -93,7 +94,7 @@ function getRunnerInfo(runningInstances: DescribeInstancesResult) {
9394
}
9495

9596
export async function terminateRunner(instanceId: string): Promise<void> {
96-
const ec2 = new EC2Client({ region: process.env.AWS_REGION });
97+
const ec2 = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
9798
await ec2.send(new TerminateInstancesCommand({ InstanceIds: [instanceId] }));
9899
logger.info(`Runner ${instanceId} has been terminated.`);
99100
}
@@ -126,7 +127,7 @@ export async function createRunner(runnerParameters: Runners.RunnerInputParamete
126127
},
127128
});
128129

129-
const ec2Client = new EC2Client({ region: process.env.AWS_REGION });
130+
const ec2Client = getTracedAWSV3Client(new EC2Client({ region: process.env.AWS_REGION }));
130131

131132
let amiIdOverride = undefined;
132133

@@ -145,13 +146,19 @@ export async function createRunner(runnerParameters: Runners.RunnerInputParamete
145146
}
146147

147148
const numberOfRunners = runnerParameters.numberOfRunners ? runnerParameters.numberOfRunners : 1;
149+
148150
const tags = [
149151
{ Key: 'ghr:Application', Value: 'github-action-runner' },
150152
{ Key: 'ghr:created_by', Value: numberOfRunners === 1 ? 'scale-up-lambda' : 'pool-lambda' },
151153
{ Key: 'ghr:Type', Value: runnerParameters.runnerType },
152154
{ Key: 'ghr:Owner', Value: runnerParameters.runnerOwner },
153155
];
154156

157+
if (runnerParameters.tracingEnabled) {
158+
const traceId = tracer.getRootXrayTraceId();
159+
tags.push({ Key: 'ghr:trace_id', Value: traceId! });
160+
}
161+
155162
let fleet: CreateFleetResult;
156163
try {
157164
// see for spec https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CreateFleet.html
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
import axios, { AxiosResponse } from 'axios';
2+
3+
import { axiosFetch } from './fetch-override';
4+
5+
jest.mock('axios');
6+
type FetchResponse = AxiosResponse & { json: () => string };
7+
8+
describe('axiosFetch', () => {
9+
beforeEach(() => {
10+
jest.clearAllMocks();
11+
});
12+
it('should return a promise that resolves with the response data', async () => {
13+
// Arrange
14+
const url = 'https://example.com';
15+
const options = { body: { foo: 'bar' } };
16+
const responseData = { data: { baz: 'qux' } };
17+
const mockedAxios = axios as unknown as jest.Mock;
18+
mockedAxios.mockResolvedValue(responseData);
19+
20+
// Act
21+
const result = (await axiosFetch(url, options)) as FetchResponse;
22+
23+
// Assert
24+
expect(axios).toHaveBeenCalledWith(url, { ...options, data: options.body });
25+
expect(result).toEqual({
26+
...responseData,
27+
json: expect.any(Function),
28+
});
29+
expect(result.json()).toEqual(responseData.data);
30+
});
31+
});
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import axios, { AxiosRequestConfig, AxiosResponse } from 'axios';
2+
3+
type FetchResponse = AxiosResponse & { json: () => string };
4+
5+
type FetchOptions = AxiosRequestConfig & { body?: object };
6+
7+
// Fetch is not covered to be traced by xray so we need to override it with axios
8+
// https://github.com/aws/aws-xray-sdk-node/issues/531
9+
export const axiosFetch = async (url: string, options: FetchOptions): Promise<FetchResponse> => {
10+
const response = await axios(url, { ...options, data: options.body });
11+
return new Promise((resolve) => {
12+
resolve({
13+
...response,
14+
json: () => {
15+
return response.data;
16+
},
17+
});
18+
});
19+
};

Diff for: lambdas/functions/control-plane/src/gh-auth/gh-auth.test.ts

+2-2
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ ${decryptedValue}`,
9595

9696
// Assert
9797
expect(mockedCreatAppAuth).toBeCalledTimes(1);
98-
expect(mockedCreatAppAuth).toBeCalledWith(authOptions);
98+
expect(mockedCreatAppAuth).toBeCalledWith({ ...authOptions, request: expect.anything() });
9999
});
100100

101101
test('Creates auth object for public GitHub', async () => {
@@ -121,7 +121,7 @@ ${decryptedValue}`,
121121
expect(getParameter).toBeCalledWith(PARAMETER_GITHUB_APP_KEY_BASE64_NAME);
122122

123123
expect(mockedCreatAppAuth).toBeCalledTimes(1);
124-
expect(mockedCreatAppAuth).toBeCalledWith(authOptions);
124+
expect(mockedCreatAppAuth).toBeCalledWith({ ...authOptions, request: expect.anything() });
125125
expect(mockedAuth).toBeCalledWith({ type: authType });
126126
expect(result.token).toBe(token);
127127
});

Diff for: lambdas/functions/control-plane/src/gh-auth/gh-auth.ts

+8-1
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,13 @@ import { Octokit } from '@octokit/rest';
1313
import { createChildLogger } from '@terraform-aws-github-runner/aws-powertools-util';
1414
import { getParameter } from '@terraform-aws-github-runner/aws-ssm-util';
1515

16-
const logger = createChildLogger('gh-auth');
16+
import { axiosFetch } from '../axios/fetch-override';
1717

18+
const logger = createChildLogger('gh-auth');
1819
export async function createOctoClient(token: string, ghesApiUrl = ''): Promise<Octokit> {
1920
const ocktokitOptions: OctokitOptions = {
2021
auth: token,
22+
request: { fetch: axiosFetch },
2123
};
2224
if (ghesApiUrl) {
2325
ocktokitOptions.baseUrl = ghesApiUrl;
@@ -64,7 +66,12 @@ async function createAuth(installationId: number | undefined, ghesApiUrl: string
6466
if (ghesApiUrl) {
6567
authOptions.request = request.defaults({
6668
baseUrl: ghesApiUrl,
69+
request: {
70+
fetch: axiosFetch,
71+
},
6772
});
73+
} else {
74+
authOptions.request = request.defaults({ request: { fetch: axiosFetch } });
6875
}
6976
return createAppAuth(authOptions);
7077
}

Diff for: lambdas/functions/control-plane/src/lambda.test.ts

+10-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
import { logger } from '@terraform-aws-github-runner/aws-powertools-util';
1+
import { captureLambdaHandler, logger } from '@terraform-aws-github-runner/aws-powertools-util';
22
import { Context, SQSEvent, SQSRecord } from 'aws-lambda';
33
import { mocked } from 'jest-mock';
44

5-
import { adjustPool, scaleDownHandler, scaleUpHandler, ssmHousekeeper } from './lambda';
5+
import { addMiddleware, adjustPool, scaleDownHandler, scaleUpHandler, ssmHousekeeper } from './lambda';
66
import { adjust } from './pool/pool';
77
import ScaleError from './scale-runners/ScaleError';
88
import { scaleDown } from './scale-runners/scale-down';
@@ -161,6 +161,14 @@ describe('Adjust pool.', () => {
161161
});
162162
});
163163

164+
describe('Test middleware', () => {
165+
it('Should have a working middleware', async () => {
166+
const mockedLambdaHandler = captureLambdaHandler as unknown as jest.Mock;
167+
mockedLambdaHandler.mockReturnValue({ before: jest.fn(), after: jest.fn(), onError: jest.fn() });
168+
expect(addMiddleware).not.toThrowError();
169+
});
170+
});
171+
164172
describe('Test ssm housekeeper lambda wrapper.', () => {
165173
it('Invoke without errors.', async () => {
166174
const mock = mocked(cleanSSMTokens);

0 commit comments

Comments
 (0)