Skip to content

[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #7901
jenting opened this issue Oct 14, 2022 · 7 comments
Labels
meta: stale This issue/PR is stale and will be closed soon team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@jenting
Copy link
Contributor

jenting commented Oct 14, 2022

Bug description

There is a chance that the ws-daemon reports it can't find the workspace while the workspace pod stopping.

var workspaceExistsResult *wsdaemon.IsWorkspaceExistsResponse
workspaceExistsResult, err = snc.IsWorkspaceExists(ctx, &wsdaemon.IsWorkspaceExistsRequest{Id: workspaceID})
if err != nil {
tracing.LogError(span, err)
return nil, err
}
if !workspaceExistsResult.Exists {
// nothing to backup, workspace does not exist
return nil, status.Error(codes.NotFound, "workspace does not exist")
}

It makes the symptom

  • workspace pod is terminating forever
  • PVC object is bound
  • VolumeSnapshot object is ready to use
  • The snapshot exists on GCP snapshots

Steps to reproduce

It's not easy to reproduce. At least we can't reproduce it when loadgen 100 workspaces and stopping 100 workspaces simultaneously. But we reproduced it when loadgen 1k workspaces and stopping 1k workspaces simultaneously.

In reference to the comments

Workspace affected

No response

Expected behavior

The ws-manager should remove this terminating workspace pod.
The metric workspace_backups_failure_total should not increase, keeps in 0.
The metrcis workspace_backups_success_total should increase to 1k.

Example repository

No response

Anything else?

#7901

@jenting jenting added the type: bug Something isn't working label Oct 14, 2022
@jenting jenting added the team: workspace Issue belongs to the Workspace team label Oct 14, 2022
@sagor999
Copy link
Contributor

I think this is probably ws-daemon bug rather then PVC issue.
I have a feeling same issue will occur if stopping 1k of regular workspaces.
🤔

@kylos101
Copy link
Contributor

Hi @jenting, for posterity, are you working on a fix in https://github.com/gitpod-io/gitpod/compare/jenting/pvc-finalizeWorkspaceContent?

Also, does this result in data loss? What intervention is required from workspace team when this happens, to make sure ws-manager triggers the backup, so user snapshot finishes?

@jenting
Copy link
Contributor Author

jenting commented Oct 17, 2022

Hi @jenting, for posterity, are you working on a fix in https://github.com/gitpod-io/gitpod/compare/jenting/pvc-finalizeWorkspaceContent?

It's good to fix, but testing requires lots of time because it happens when we stop 1k workspaces simultaneously.

I'd suggest we put this issue as good to fix.

Also, does this result in data loss? What intervention is required from the workspace team when this happens, to make sure ws-manager triggers the backup, so the user snapshot finishes?

No data was lost, but the ws-manager metrics report data loss (false alarm).

The volume snapshot is ready to use, and we just need to ensure the volume snapshot information is written to DB thru. ws-manager-bridge so the user can reopen the workspace without data loss.

@jenting
Copy link
Contributor Author

jenting commented Nov 5, 2022

We update the runbook when we encounter the symptom

  • workspace pod is terminating forever
  • PVC object is bound
  • VolumeSnapshot object is ready to use

I think this is probably ws-daemon bug rather then PVC issue.

I agree with Pavel that it's a bug within ws-daemon.


For me, it's a low priority from PVC's point of view because there is no data loss.
If we have no other outstanding issues, we could schedule it to fix.

But I am not sure what will happens when backup to GCS 🤔.
If it happens when backup to GCS as well, should we keep putting effort into fixing the GCS issue @kylos101?
If not, we choose the decision to fix the ws-manager.
If yes, we choose the decision to fix the ws-daemon.

@jenting
Copy link
Contributor Author

jenting commented Nov 5, 2022

@jenting was a snapshot published to GCP, too, or no?

Yes, the GCP snapshot exists

@jenting you mention No data was lost, but the ws-manager metrics report data loss (false alarm).. What is the expected behavior for the metric?

The ws-manager should remove this terminating workspace pod and the metric workspace_backups_failure_total should not increase.

@jenting you mention It's good to fix, but testing requires lots of time because it happens when we stop 1k workspaces simultaneously. I will try to run another loadgen to see if I can help recreate. Can you share what the expected behavior was for the workspace?

There should no workspace pod in terminating so a long time even the volume snapshot is ready to use when stop 1K workspace (with PVC) simultaneously.

@kylos101
Copy link
Contributor

kylos101 commented Nov 5, 2022

@jenting 🥳 Fantastic! Can you update this issue's expected behaviors in the description? This way it is up-to-date with our breakdown. Suggestion: also say the expected behavior for workspace_backups_success_total in the expected behavior. Alternatively, if you'd prefer to close this issue, and start a new one, that is fine, too.

@stale
Copy link

stale bot commented Feb 5, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Feb 5, 2023
@kylos101 kylos101 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2023
@github-project-automation github-project-automation bot moved this to Awaiting Deployment in 🌌 Workspace Team Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta: stale This issue/PR is stale and will be closed soon team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants