Skip to content

Upgrade Self-Hosted Runners to Node20 #2573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AlexandreSinger opened this issue May 31, 2024 · 8 comments
Open

Upgrade Self-Hosted Runners to Node20 #2573

AlexandreSinger opened this issue May 31, 2024 · 8 comments

Comments

@AlexandreSinger
Copy link
Contributor

As described in this blog post, GitHub Actions are transitioning from Node16 to Node20:
https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/

All of the CI tests that are running on the GitHub-hosted runners has been moved to Node20 by just changing the version of the Actions in PR #2568

The self-hosted runners were unable to be upgraded, giving a warning that the machine did not have Node20 available, only Node16. TODO statements were added to the CI test script to upgrade the runners before upgrading the necessary actions.

The blog post above makes it clear that the self-hosted runners need to be upgraded to v2.308.0 or later:
image

Once the self-hosted runners are upgraded, we can upgrade the actions to fully resolve the deprecations.

@AlexandreSinger
Copy link
Contributor Author

AlexandreSinger commented Jun 11, 2024

Here are the logs from the failed CI run which arose from trying to upgrade the actions on the self-hosted runners:
image

It looks like the current version of the self-hosted runners is v2.316.1 which should have node20 installed; but for some reason it doesn't which is quite odd.

The CI run: https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/9279402994/job/25531941254

@AlexandreSinger
Copy link
Contributor Author

@vaughnbetz It looks like the version of the self-hosted runners is correct... But there still seems to be something wrong since they do not support node20 yet. Ill send an update in the email chain.

@AlexandreSinger
Copy link
Contributor Author

Running into tons of issues with this...

The first thing I tried was using that setup-node action before checkout (https://github.com/actions/setup-node ). This yielded the exact same error:
Screenshot from 2024-06-27 15-37-28
The issue going on here is that it looks like it is failing BEFORE it even runs any of the tasks in the job. This does not make any sense since, if we truly were using runner version 2.317, it should recognize the node20 parameter. Something is really really fishy here. Its almost as if it is on an older version of the runner, but it is saying that it is on a newer one...

I then tried changing us to Ubuntu 24.04 just to see if that would work (since we plan on upgrading to that in the future); but even that had the same error:
Screenshot from 2024-06-27 15-39-26
I am not sure for a fact, but I assume Ubuntu 24.04 has Node20 installed; so this error is now super weird.

I then tried deleting the container line all together (as was recommended in the VTR industry sync):
Screenshot from 2024-06-27 15-40-54
This caused the strangest error: The CI hanged on "Setting up VM"; I let it hang for around 10 minutes before killing it (it usually takes 1 min to setup).

I then tried setting he container to container: ubuntu-24.04 so that it would match the GitHub runners. The CI really really did not like this. It created an infinite loop where it would set up the VM and then immediately tear it down:
Screenshot from 2024-06-27 15-43-33
The errors all look something like this:
Screenshot from 2024-06-27 15-44-11
Clearly no container exists with this name; but this is some crazy behaviour when the container cannot be found.

I tried googling around, but no one appears to be running into this exact same issue. Everyone seems to resolve this issue by upgrading the runner version to a more recent version. I am beginning to not trust the version returned by the log, but I am super confused.

One idea I have is we can use the container image that we generate in VTR. That way we know the image has everything we need; however, it leads to an issue where our build depends on itself. For example, if the release build is failing, we would have trouble fixing it since we would rely on the release builds container.

The current working PR on this is PR #2632

@vaughnbetz What do you think about this mess? Who originally set up the self-hosted runners who we can talk to about this?

@vaughnbetz
Copy link
Contributor

Ugh. Thanks for investigating @AlexandreSinger . Adding @kmurray and @tangxifan and @jgoeders in case they have any ideas. @kgugala may have set up the original self-hosted runners; Karol, any ideas much appreciated!

@jgoeders
Copy link
Contributor

jgoeders commented Jun 28, 2024

Based on the above, it sounds like we need to get node20 installed in the image before any other actions are run.

I am not sure for a fact, but I assume Ubuntu 24.04 has Node20 installed; so this error is now super weird.

The docker image is typically a completely stripped down version, without most of the packaging that comes when you install Ubuntu on your own machine.

I just tested a bare ubuntu:24.04 docker image and indeed it does not include any node version.

jgoeders@jg-laptop:~$ docker run -it ubuntu:20.04
Unable to find image 'ubuntu:20.04' locally
20.04: Pulling from library/ubuntu
9ea8908f4765: Pull complete
Digest: sha256:0b897358ff6624825fb50d20ffb605ab0eaea77ced0adb8c6a4b756513dec6fc
Status: Downloaded newer image for ubuntu:20.04
root@5ddf86f379f6:/# node -v
bash: node: command not found
root@5ddf86f379f6:/#

I think to resolve this we could either:

  1. Figure out how to specify other options along with the container: command that will allow you to configure the container to install node20. For example, I asked ChatGPT to give me a docker configuration file that would install node20 on ubuntu:jammy:
    # Use the official Ubuntu Jammy image as a base
    FROM ubuntu:jammy
    
    # Set the environment variable to noninteractive
    ENV DEBIAN_FRONTEND=noninteractive
    
    # Install necessary packages and Node.js 20
    RUN apt-get update && \
        apt-get install -y curl gnupg && \
        curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
        apt-get install -y nodejs && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*
    
    # Verify installation
    RUN node -v && npm -v
    
    # Set working directory
    WORKDIR /usr/src/app
    
    # Copy application files
    COPY . .
    
    # Specify the command to run the application
    CMD ["node", "app.js"]
    

This page seems to have some documentation about how to configure the container.

  1. Create our own container that we host on Docker Hub that is ubuntu:jammy with node20 installed. You mentioned

One idea I have is we can use the container image that we generate in VTR. That way we know the image has everything we need; however, it leads to an issue where our build depends on itself.
...but we could just create a bare-bones dockerhub image that is different from this VTR version.

Hope this helps.

@AlexandreSinger
Copy link
Contributor Author

AlexandreSinger commented Jul 5, 2024

Update on my most recent findings.

The issue we are facing must be caused by the GitHub runner version. Digging into the error message shown in the previous images, since the GitHub Action Runner code is open-source, I was able to find the exact error we are running into:
https://github.com/actions/runner/blob/v2.317.0/src/Runner.Worker/ActionManifestManager.cs#L493
image

However, notice that the error message is actually different! Notice that in v2.317 (which our self-hosted runners claim to be), node20 is an acceptable parameter. This error message was changed in v2.308. Notice in v2.307.1 and prior, the error message now matches the error message that we are seeing:
https://github.com/actions/runner/blob/v2.307.1/src/Runner.Worker/ActionManifestManager.cs#L503
Screenshot from 2024-07-05 14-10-53

This implies that, although our self-hosted runners say that they are on v2.317 of the runner, they do not appear to actually be on the correct version.

Our running theory is that the VM itself may be on one version, while the image running on the machine may be on a different one.

@AlexandreSinger
Copy link
Contributor Author

Looking into one of the runs of the CI, I saw this line being generated by the self-hosted runners:
Screenshot from 2024-07-05 14-13-48

Using the all-mighty powers of Google, I was able to find that n2-highmem-16 is actually a Google Cloud VM running on a Google Compute Engine (GCE) from the Google Cloud Platform:
https://gcloud-compute.com/n2-highmem-16.html

Now we at least have a lead as to what the type of machines we are dealing with here are.

Looking around Google some more, I found that people set up GCE to run GitHub CIs; and I found a pretty clear article on how this may be set up:
https://medium.com/@vngauv/from-github-to-gce-automate-deployment-with-github-actions-27e89ba6add8

I think VTR must have a Google Cloud Project set up somewhere and this project is where we can get access to the image being used by the self-hosted runners (and allow us to modify it to fix our issue).

@AlexandreSinger
Copy link
Contributor Author

@jgoeders Thank you so much for your comment. The issue I am running into is that I was unable to access the machine to regenerate the image. I think you are correct though that the image is the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants