Skip to content

Opening old workspaces stuck on pulling container image #8198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #7901
adg25 opened this issue Feb 14, 2022 · 18 comments
Closed
Tracked by #7901

Opening old workspaces stuck on pulling container image #8198

adg25 opened this issue Feb 14, 2022 · 18 comments
Labels
team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@adg25
Copy link

adg25 commented Feb 14, 2022

Bug description

Recently (last 2 weeks, maybe more) members of my team have not been able to open previously closed workspaces. Opening new pods with the gitlab browser extension works just fine, but re-opening pods gets stuck on the "Pulling container image" step. I don't think there's anything out of the ordinary with our setup, we have a .gitpod.yml which specifies pulling the latest image from our docker registry hosted in gitlab. Some members of the team don't have issues with this, some do, and some are on and off.

Steps to reproduce

Open a new gitpod workspace from a gitlab merge request, close the workspace and try to open it again. Gets hung on the "Pulling container image step"

Workspace affected

All workspaces

Expected behavior

I expect the workspace to pull the container image and start properly

Example repository

.gitpod.yml file content for image:
image: registry.gitlab.com/<my_company>/path/to/hosted_images:latest

Anything else?

No response

@pawlean pawlean added type: bug Something isn't working team: workspace Issue belongs to the Workspace team labels Feb 14, 2022
@axonasif
Copy link
Member

@adg25 I think it would be helpful for our workspace-team if you can share the affected workspace-id too!

@adg25
Copy link
Author

adg25 commented Feb 15, 2022

In classic bug ticket fashion now all of my workspaces are unaffected by this issue! But another member of my team is experiencing it, his workspace is https://belvederetradin-tribe-y684d5w18ri.ws-us31.gitpod.io/

Edit: 1:14pm it's currently happening to https://belvederetradin-tribe-tke67qfj000.ws-us31.gitpod.io/

@adg25
Copy link
Author

adg25 commented Feb 17, 2022

I tried opening a workspace I was editing yesterday and I got this error message:

Timed Out
last backup failed: cannot delete workspace from store.

Workspace is https://belvederetradin-tribe-3rrw2k0grv6.ws-us32.gitpod.io/

@adg25
Copy link
Author

adg25 commented Feb 23, 2022

@axonasif Any update on this?

@pawlean
Copy link
Contributor

pawlean commented Feb 24, 2022

Might be one for @kylos101 - copying him in.

@adg25
Copy link
Author

adg25 commented Feb 24, 2022

Thanks! We may have some idea of what's going on here actually. We are using Bazel as the build system for our mono repo and modified the bazelrc so that

startup --output_base /workspace/bazel-cache

Which means that the bazel build cache lives in /workspace instead of the default /home directory. We did this because the cache would take up more than 5Gi of space and crash all of our workspaces, and a recommendation from the gitpod discord was to utilize the 30Gb we get from /workspace. However we think that when workspaces stop with a built cache the images are too large to be pulled again (and subsequently time-out. I've confirmed that if I execute a bazel clean before stopping a workspace then it is able to open just fine. Here's some output from df before and after a bazel clean

gitpod /workspace/monorepo $ df -h /workspace
Filesystem      Size  Used Avail Use% Mounted on
/dev/md42        30G   12G   19G  37% /workspace
gitpod /workspace/monorepo $ bazel clean
(22:17:58) INFO: Invocation ID: 26b2127d-e11d-4550-a638-283a5971fa03
(22:17:58) INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
gitpod /workspace/tribe $ df -h /workspace
Filesystem      Size  Used Avail Use% Mounted on
/dev/md42        30G  3.5G   27G  12% /workspace
gitpod /workspace/monorepo $

@kylos101
Copy link
Contributor

Thanks for the heads up @adg25 !

How large of a /workspace were you having trouble with?

@adg25
Copy link
Author

adg25 commented Feb 25, 2022

@kylos101 Certainly the 12G there is a problem to re-pull the container image after it's stopped - and the 3.5G is good enough (less than a minute usually). I don't really have a metric for "at X size we start having these issues", but there's rarely an in-between for devs anyway since usually we either operate with a full bazel cache or an empty one.

@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team Feb 25, 2022
@kylos101
Copy link
Contributor

kylos101 commented Feb 25, 2022

Thanks for the heads up, @adg25 ! I've scheduled this for our team to do some research. Have a nice weekend! Assuming we can recreate the problem, that should give us some better insight.

While I'm thinking about it, if you're able, could you share a simple repo where you're able to reproduce the problem? It's okay to say no, just figured I would ask.

@adg25
Copy link
Author

adg25 commented Feb 28, 2022

Thanks @kylos101. Unfortunately I can't share the repo with you since it's private but I can share some relevant metrics. We currently have 165 bazel BUILD files that include java_* cc_* py_* and container_* (library/binary/image/push) modules and in total about 47k lines of code

@kylos101
Copy link
Contributor

Thanks, @adg25 that should help us recreate a similarly sized /workspace.

@sagor999
Copy link
Contributor

I looked at logs and found interesting data point regarding cannot delete workspace from store:

cannot delete session:
    github.com/gitpod-io/gitpod/ws-daemon/pkg/internal/session.(*Store).Delete
        github.com/gitpod-io/gitpod/ws-daemon/pkg/internal/session/store.go:143
  - cannot remove workspace:
    github.com/gitpod-io/gitpod/ws-daemon/pkg/internal/session.(*Workspace).Dispose
        github.com/gitpod-io/gitpod/ws-daemon/pkg/internal/session/workspace.go:222
  - unlinkat /mnt/workingarea/e650cb4f-46a7-4549-87d4-16488c24d76d/bliss/node_modules: directory not empty

Internal link for logs: https://cloudlogging.app.goo.gl/nH3fexYGbsGBcAkj7
My speculation is that as we are trying to remove the files, some node process keeps on writing files into that directory.
@csweichel Can you confirm if we completely shutdown all processes before attempting to delete the session? This is currently causes workspace to fail (and it would stuck in terminating state until it fails on timeout)

@adg25
Copy link
Author

adg25 commented Mar 2, 2022

@sagor999 I don't have access to those logs, but that would make sense as a theory

@sagor999
Copy link
Contributor

sagor999 commented Mar 3, 2022

@aledbf when we start backup process, can you confirm if we ensure that all user processes in workspace have been stopped? I think this error above happens because some build process is still churning out files while we are trying to backup and cleanup.

@adg25
Copy link
Author

adg25 commented Mar 10, 2022

@sagor999 @kylos101 This has been open for a few weeks now and is still an active issue for us. We have the workaround as a I mentioned to do a bazel clean before closing the workspace but that's not a good long term solution

@sagor999
Copy link
Contributor

@adg25 sorry for late response, was on vacation.
We are going to be redoing part of backup\storage for workspaces, and as part of that work this should also be resolved.
But it will take at least 1-2 months before it will be ready for production.
You can track progress here:
#7901

@kylos101
Copy link
Contributor

Thanks @sagor999 , for now, I've removed from our project and scheduled groundwork, as #7901 should resolve. If it does not, we can always add it back.

@sagor999
Copy link
Contributor

@adg25 I think this actually has been fixed now with this: #9803
Let me know if you still encounter it. Old workspaces prior to this fix will not work, but after should be working now.
I will close this, but feel free to re-open if you are still experiencing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants