[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856

jenting · 2022-10-14T02:13:42Z

Bug description

There is a chance that the ws-daemon reports it can't find the workspace while the workspace pod stopping.

gitpod/components/ws-manager/pkg/manager/monitor.go

Lines 1074 to 1083 in c9528c8

    
           var workspaceExistsResult *wsdaemon.IsWorkspaceExistsResponse 
        
           workspaceExistsResult, err = snc.IsWorkspaceExists(ctx, &wsdaemon.IsWorkspaceExistsRequest{Id: workspaceID}) 
        
           if err != nil { 
        
           	tracing.LogError(span, err) 
        
           	return nil, err 
        
           } 
        
           if !workspaceExistsResult.Exists { 
        
           	// nothing to backup, workspace does not exist 
        
           	return nil, status.Error(codes.NotFound, "workspace does not exist") 
        
           }

It makes the symptom

workspace pod is terminating forever
PVC object is bound
VolumeSnapshot object is ready to use
The snapshot exists on GCP snapshots

Steps to reproduce

It's not easy to reproduce. At least we can't reproduce it when loadgen 100 workspaces and stopping 100 workspaces simultaneously. But we reproduced it when loadgen 1k workspaces and stopping 1k workspaces simultaneously.

In reference to the comments

Workspace affected

No response

Expected behavior

The ws-manager should remove this terminating workspace pod.
The metric workspace_backups_failure_total should not increase, keeps in 0.
The metrcis workspace_backups_success_total should increase to 1k.

Example repository

No response

Anything else?

#7901

sagor999 · 2022-10-14T02:17:03Z

I think this is probably ws-daemon bug rather then PVC issue.
I have a feeling same issue will occur if stopping 1k of regular workspaces.
🤔

kylos101 · 2022-10-14T22:10:15Z

Hi @jenting, for posterity, are you working on a fix in https://github.com/gitpod-io/gitpod/compare/jenting/pvc-finalizeWorkspaceContent?

Also, does this result in data loss? What intervention is required from workspace team when this happens, to make sure ws-manager triggers the backup, so user snapshot finishes?

jenting · 2022-10-17T06:35:36Z

Hi @jenting, for posterity, are you working on a fix in https://github.com/gitpod-io/gitpod/compare/jenting/pvc-finalizeWorkspaceContent?

It's good to fix, but testing requires lots of time because it happens when we stop 1k workspaces simultaneously.

I'd suggest we put this issue as good to fix.

Also, does this result in data loss? What intervention is required from the workspace team when this happens, to make sure ws-manager triggers the backup, so the user snapshot finishes?

No data was lost, but the ws-manager metrics report data loss (false alarm).

The volume snapshot is ready to use, and we just need to ensure the volume snapshot information is written to DB thru. ws-manager-bridge so the user can reopen the workspace without data loss.

jenting · 2022-11-05T03:13:01Z

We update the runbook when we encounter the symptom

workspace pod is terminating forever
PVC object is bound
VolumeSnapshot object is ready to use

I think this is probably ws-daemon bug rather then PVC issue.

I agree with Pavel that it's a bug within ws-daemon.

For me, it's a low priority from PVC's point of view because there is no data loss.
If we have no other outstanding issues, we could schedule it to fix.

But I am not sure what will happens when backup to GCS 🤔.
If it happens when backup to GCS as well, should we keep putting effort into fixing the GCS issue @kylos101?
If not, we choose the decision to fix the ws-manager.
If yes, we choose the decision to fix the ws-daemon.

jenting · 2022-11-05T03:23:00Z

@jenting was a snapshot published to GCP, too, or no?

Yes, the GCP snapshot exists

@jenting you mention No data was lost, but the ws-manager metrics report data loss (false alarm).. What is the expected behavior for the metric?

The ws-manager should remove this terminating workspace pod and the metric workspace_backups_failure_total should not increase.

@jenting you mention It's good to fix, but testing requires lots of time because it happens when we stop 1k workspaces simultaneously. I will try to run another loadgen to see if I can help recreate. Can you share what the expected behavior was for the workspace?

There should no workspace pod in terminating so a long time even the volume snapshot is ready to use when stop 1K workspace (with PVC) simultaneously.

kylos101 · 2022-11-05T03:26:33Z

@jenting 🥳 Fantastic! Can you update this issue's expected behaviors in the description? This way it is up-to-date with our breakdown. Suggestion: also say the expected behavior for workspace_backups_success_total in the expected behavior. Alternatively, if you'd prefer to close this issue, and start a new one, that is fine, too.

stale · 2023-02-05T00:26:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jenting added the type: bug Something isn't working label Oct 14, 2022

jenting added this to 🌌 Workspace Team Oct 14, 2022

jenting added the team: workspace Issue belongs to the Workspace team label Oct 14, 2022

jenting mentioned this issue Oct 14, 2022

Epic: Ensure durability for user workspace files #7901

Closed

77 tasks

kylos101 mentioned this issue Oct 14, 2022

Performance test PVC with a cluster saturated with workspaces #12747

Closed

stale bot added the meta: stale This issue/PR is stale and will be closed soon label Feb 5, 2023

kylos101 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2023

github-project-automation bot moved this to Awaiting Deployment in 🌌 Workspace Team Feb 6, 2023

kylos101 removed this from 🌌 Workspace Team Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856

[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856

jenting commented Oct 14, 2022 •

edited

Loading

sagor999 commented Oct 14, 2022

Uh oh!

kylos101 commented Oct 14, 2022

Uh oh!

jenting commented Oct 17, 2022

Uh oh!

jenting commented Nov 5, 2022

Uh oh!

jenting commented Nov 5, 2022

Uh oh!

kylos101 commented Nov 5, 2022

Uh oh!

stale bot commented Feb 5, 2023

Uh oh!

[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856

[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856

Comments

jenting commented Oct 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug description

Steps to reproduce

Workspace affected

Expected behavior

Example repository

Anything else?

sagor999 commented Oct 14, 2022

Uh oh!

kylos101 commented Oct 14, 2022

Uh oh!

jenting commented Oct 17, 2022

Uh oh!

jenting commented Nov 5, 2022

Uh oh!

jenting commented Nov 5, 2022

Uh oh!

kylos101 commented Nov 5, 2022

Uh oh!

stale bot commented Feb 5, 2023

Uh oh!

jenting commented Oct 14, 2022 •

edited

Loading