Skip to content

[PVC] orphan PVC left if the ws-manager unable to start workspace pod #13282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #7901
jenting opened this issue Sep 26, 2022 · 4 comments · Fixed by #13429 or #14068
Closed
Tracked by #7901

[PVC] orphan PVC left if the ws-manager unable to start workspace pod #13282

jenting opened this issue Sep 26, 2022 · 4 comments · Fixed by #13429 or #14068
Assignees
Labels
team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@jenting
Copy link
Contributor

jenting commented Sep 26, 2022

Bug description

Two scenarios:

  • PVC bound, but workspace pod gone: the PVC object exists even though the workspace pod is gone.
    The orphan PVC object still exists within the cluster even though the workspace pod is gone. And the ws-manager with errors

    "error":"timed out waiting for the condition","instanceId":"ea1e5ba2-54d6-4611-b8a6-ba8e9d791cb7","level":"warning","message":"was unable to start workspace","pod":"ws-ea1e5ba2-54d6-4611-b8a6-ba8e9d791cb7"
  • PVC bound and workspace pod Terminating: the PVC state from pending -> bound after one hour because of the GCP limitation, but the workspace pod timed out. The workspace pod is Terminating, and PVC bound.

Steps to reproduce

  • PVC bound, but workspace pod Pending: the PVC object exists even though the workspace pod is gone
    When running the loadgen with 200 workspaces simultaneously (100 regular workspaces + 100 regular workspaces + PVC). After testing is done, there is some orphan PVC left, and the ws-manager does not handle it. Reference code
    clog.WithError(err).WithField("req", req).WithField("pod", pod.Name).Warn("was unable to start workspace")
  • PVC pending and workspace pod Pending: the PVC pending -> bound after one hour, but the workspace pod timed out. The workspace pod is pending, and PVC bound
    [PVC] orphan PVC left if the ws-manager unable to start workspace pod #13282 (comment)

Workspace affected

No response

Expected behavior

The workspace pod and PVC object should be removed.

Example repository

No response

Anything else?

#7901

@jenting jenting added the type: bug Something isn't working label Sep 26, 2022
@jenting jenting added the team: workspace Issue belongs to the Workspace team label Sep 26, 2022
@jenting jenting changed the title [PVC] orphan PVC if the ws-manager unable to start workspace [PVC] orphan PVC left if the ws-manager unable to start workspace pod Sep 26, 2022
@jenting jenting self-assigned this Sep 28, 2022
@jenting jenting moved this to In Progress in 🌌 Workspace Team Sep 28, 2022
Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 3, 2022
@kylos101 kylos101 moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Oct 5, 2022
@jenting jenting reopened this Oct 6, 2022
@jenting jenting moved this from In Validation to Scheduled in 🌌 Workspace Team Oct 7, 2022
@kylos101
Copy link
Contributor

@jenting can you share why this moved back from In-Validation, and what the plan is for next steps here, in this issue? Also, this seems like it will be a blocker to releasing PVC to a larger group (other than Gitpodders). cc: @sagor999

@jenting
Copy link
Contributor Author

jenting commented Oct 12, 2022

The original solution has side effect that the workspace can not be terminated because we delete PVC object too early which makes the PVC in terminated state (thank to finalizer). However, when snapshot controller takes snapshot, it adds finalizer to PVC object but it fails because the finalizer can't add when object is terminated. So, the snapshot failed.

I think it's not blocker because the PVC never be mounted by the workspace pod, and the PVC object is in Pending state. What we left is we need to garbage collection the Pending PVC object.

@kylos101
Copy link
Contributor

I see, thank you, @jenting .

@jenting
Copy link
Contributor Author

jenting commented Oct 21, 2022

Another scenario is the PVC in a Pending state within one hour because of the GCP limitation, and then the PVC bound because the limitation is gone.

However, the workspace pod timed out, causing it to be in a Terminating state but PVC in a bound state.
The ws-manager should garbage collect the workspace pod and PVC. For this case, we don't need to take the volume snapshot. Related private gist.

@jenting jenting moved this from Scheduled to In Progress in 🌌 Workspace Team Oct 21, 2022
Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Nov 1, 2022
@jenting jenting moved this from Awaiting Deployment to In Validation in 🌌 Workspace Team Nov 4, 2022
@jenting jenting moved this from In Validation to Done in 🌌 Workspace Team Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
No open projects
Status: Done
2 participants