-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is probably ws-daemon bug rather then PVC issue. |
Hi @jenting, for posterity, are you working on a fix in https://github.com/gitpod-io/gitpod/compare/jenting/pvc-finalizeWorkspaceContent? Also, does this result in data loss? What intervention is required from workspace team when this happens, to make sure |
It's good to fix, but testing requires lots of time because it happens when we stop 1k workspaces simultaneously. I'd suggest we put this issue as good to fix.
No data was lost, but the ws-manager metrics report data loss (false alarm). The volume snapshot is ready to use, and we just need to ensure the volume snapshot information is written to DB thru. ws-manager-bridge so the user can reopen the workspace without data loss. |
We update the runbook when we encounter the symptom
I agree with Pavel that it's a bug within ws-daemon. For me, it's a low priority from PVC's point of view because there is no data loss. But I am not sure what will happens when backup to GCS 🤔. |
Yes, the GCP snapshot exists
The ws-manager should remove this terminating workspace pod and the metric
There should no workspace pod in terminating so a long time even the volume snapshot is ready to use when stop 1K workspace (with PVC) simultaneously. |
@jenting 🥳 Fantastic! Can you update this issue's expected behaviors in the description? This way it is up-to-date with our breakdown. Suggestion: also say the expected behavior for |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Uh oh!
There was an error while loading. Please reload this page.
Bug description
There is a chance that the ws-daemon reports it can't find the workspace while the workspace pod stopping.
gitpod/components/ws-manager/pkg/manager/monitor.go
Lines 1074 to 1083 in c9528c8
It makes the symptom
Steps to reproduce
It's not easy to reproduce. At least we can't reproduce it when loadgen 100 workspaces and stopping 100 workspaces simultaneously. But we reproduced it when loadgen 1k workspaces and stopping 1k workspaces simultaneously.
In reference to the comments
Workspace affected
No response
Expected behavior
The ws-manager should remove this terminating workspace pod.
The metric workspace_backups_failure_total should not increase, keeps in 0.
The metrcis workspace_backups_success_total should increase to 1k.
Example repository
No response
Anything else?
#7901
The text was updated successfully, but these errors were encountered: