Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096

csweichel · 2022-02-08T15:06:14Z

Bug description

Looking at the logs we're seeing a surprising amount of OTS download failures as part of workspace content initialisation: https://console.cloud.google.com/logs/query;query=%22cannot%20download%20OTS%22%0A;timeRange=PT24H;cursorTimestamp=2022-02-08T14:43:32Z?project=workspace-clusters

Each of those failures is likely to yield a failed workspace - at least if the repo was private.
Possible contributing factors:

(edited by Sven)

as this seems to happen only in prebuilds (only US cluster and lots of this error messages in d_b_prebuild_workspace), it could be that for some reason the time between the OTS is created and when it gets requested is longer than 30min (the lifetime of a token). Prebuild clusters are sometimes heavily packed so maybe there is just too much time in scaling up etc.
we attempt to download the OTS multiple times for some reason. That's most likely a bug in the initializer. Checking the server logs and/or adding metrics would help identifying this.

As part of a fix, we should introduce OTS download failure metrics and keep an eye on them.

Steps to reproduce

Check the logs

kylos101 · 2022-02-09T03:35:50Z

Also interesting.

sagor999 · 2022-02-11T02:02:09Z

gitpod/components/content-service/pkg/initializer/initializer.go

Line 303 in a8e15a6

for i := 0; i < otsDownloadAttempts; i++ {

According to this line we will attempt to download OTS at most 10 times, with 1 second sleep in between each attempt.
This PR adds metrics to track this #8148

As for load balancer question, that is probably for webapp team? @gitpod-io/engineering-webapp

geropl · 2022-02-11T08:10:39Z

As for load balancer question, that is probably for webapp team?

I think we should focus on unifying our DBs into one, so we don't need to expose app cluster identity to users. Especially given how long this has been in prod, and how resource constraint we as a team are (/cc @JanKoehnlein ).

As a bandaid, we could extend this timeout to 20s, which should in all but very rare cases plenty of time to include the db-sync roundtrip.

sagor999 · 2022-02-25T18:38:42Z

Reopening as this still occurs:
https://twitter.com/srherobrine23/status/1497265903641116675?s=20&t=Taai1veK0dkGdYsmYqKnvw
WS id: "1af7b6dd-37c0-4c7b-b9e8-a94858e5f339"
Logs:

"cannot download OTS"
"content init failed"
"cannot initialize workspace"
"InitWorkspace failed"

geropl · 2022-02-28T08:09:49Z

@sagor999 Would it make sense to sync on this issue to speed up investigation? Would love to understand what you discovered so far.

sagor999 · 2022-02-28T17:09:00Z

@geropl I have not dig deep into this issue unfortunately and I think @csweichel have a deeper understanding of the problem.
But from what I understood, we have several possible scenarios:

OTS never gets created
It takes longer than 20 seconds for OTS to sync between databases, causes OTS timeout to trigger.
Maybe we somehow attempt to read same OTS twice, causing second read to always fail.

You previously mentioned that 20 seconds should be enough for OTS to sync between DBs. Can you confirm if that is indeed enough time?

sagor999 · 2022-03-02T22:35:43Z

@geropl how does OTS gets synced between clusters? AFAIU, currently we are failing to download OTS because it was not replicated to the cluster from which we are making request.
Is there a way to speed up OTS reconcile process? Or is there some sort of guarantee time wise how long it might take to run reconcile?

geropl · 2022-03-03T07:51:25Z

@sagor999 I will allocate some time today to look into this.

geropl · 2022-03-03T15:45:02Z

how does OTS gets synced between clusters?

It's configured here, and synced by db-sync. Looking at the prod this seems to work.

Is there a way to speed up OTS reconcile process? Or is there some sort of guarantee time wise how long it might take to run reconcile?

I'm working on merging the DBs, that will resolve it. But there is so much hours in the day. 🙃

I had a look at the most recent occurrences in the logs (download into ots.json, extract with cat ots.json | jq '.[] | .jsonPayload.workspaceId' | sort | uniq)[1]:

all reported cases are prebuilds (GitHub, but also Gitlab)
only 1 in 40 prebuilds are marked with "error" despite all the warnings. error: "failed OTS download"
~ 50% of these prebuilds are private
all are starting 0.5-2 minutes after they have been created: so the OTS timing out seems to be not the issue
I looked into two in detail:
- private: true, prebuild marked as "available", no errors (logs): it tries to get the OTS token 240 times, from 11:02:59 to 12:08:05 when it finally succeeds. Judging by DB, the pod has been running only from 11:01:49 to 11:01:50 (:point_left: suspicious), but does not report any error whatsoever.
  - ws-manager logs indicate the workspace fails immediately (as do the DB timings)
  - looking at the ws-manager-bridge traces, we:
    - see a first event with the error container workspace ran with an error: exit code 1 (which correponds to the ws-manager logs): link (cmp. status and after fields)
    - but 14s later see we get another event that just says "regularly stopped": link (cmp. status and after fields)
- private: true, prebuild "aborted", shows OTS error (logs): fails after the first 20 tries

I have to quit now, but see the following ToDos:

workspace: why do we re-start instances 12 times?
workspace: why do we loose the "failed" state? (maybe because of the "restart"? 🤔 )
workspace: maybe add metrics/improve logging so it's easier to distinguish between "temporary retry" and "failed OTS download"
webapp: investigate the "cannot get OTS" cases, starting with those found in the DB. I expect the re-starts ☝️ to shadow some of these as well

Also, to further reduce noise, I suggest splitting the issue into two: one workspace, one webapp.
@sagor999 @kylos101 WDYT?

[1] DB query:

SELECT ws.type, pws.state, pws.error, ws.context->>'$.repository.private', ws.creationTime, wsi.region, wsi.startedTime, wsi.stoppingTime, wsi.status, ws.*, wsi.*
   FROM d_b_workspace AS ws
   JOIN d_b_workspace_instance AS wsi
   	ON wsi.workspaceId = ws.id
   JOIN d_b_prebuilt_workspace AS pws
   	ON pws.buildWorkspaceId = ws.id
   WHERE ws.id IN ( ... );

sagor999 · 2022-03-03T23:55:27Z

Thank you @geropl for digging deeper into this!

workspace: why do we re-start instances 12 times?

I don't think actual ws instance is restarted 12 times, but some other process does retry conent-init 12 times it seems like.
@csweichel in case if you have some comments regarding this. Going through code I couldn't quite figure out where that might be happening though.

workspace: maybe add metrics/improve logging so it's easier to distinguish between "temporary retry" and "failed OTS download"

#8587

workspace: why do we loose the "failed" state? (maybe because of the "restart"? thinking )

where do we lose it? Since in logs I see that workspace did fail. So I assume you talking about the state of it in DB maybe?

Also, to further reduce noise, I suggest splitting the issue into two: one workspace, one webapp.

We could probably keep this for workspace related fixes.
And then create a new one for webapp related work.

geropl · 2022-03-04T09:50:56Z

I don't think actual ws instance is restarted 12 times, but some other process does retry conent-init 12 times it seems like.

where do we lose it? Since in logs I see that workspace did fail. So I assume you talking about the state of it in DB maybe?

Ah, the content-init retry make sense 💡 - the "re-start" might have been me jumping to false conclusons. But why did it continue after ws-manager reported the workspace to be terminated/stopped? 🤔

Basically the question is: What events where emitted by ws-manager, in what order, to make ws-manager-bridge:

think the workspace stopped 1s after it started, incl. failed state
later receive new events for the same instance that override the failed state
(both is documented in the traces mentioned above [1, 2])

@sagor999 Could you:

pull out the ws-manager trace logs (on first glance they look mangled?)
correlate those with the honecomb traces (what ws-manager-bridge receives)
and find out why either:
- content initialization is not stopped when the workspace is terminating
- OR: the workspace is reported as terminating although content initialiaztion is still running
- where/ the updates without "failed" state are generated
  (I guess all of this is tighly coupled)

We could probably keep this for workspace related fixes.
And then create a new one for webapp related work.

👍 Here it is: #8595

geropl · 2022-03-04T09:51:20Z

Moved the webapp-parts over to #8595

csweichel · 2022-03-04T10:41:30Z

Not all workspace status conditions ws-manager reports are stable. E.g. failed can be set in one update, and not in another. That's undesirable and we should consider that a bug, but it is the current behaviour.

@geropl ws-manager emits a version field which can be used to order status updates - we are not making use of that in ws-manager-bridge yet. We'll want to start storing that in the DB to ensure we're not "updating" with old status updates for some reason. This would also be very handy to do away with "workspace phase based ordering heuristics".

Re content init retries: that's a bug. The OTS is should be downloaded once per initializer run only

gitpod/components/content-service/pkg/initializer/initializer.go

Line 231 in 3063396

user, pwd, err = downloadOTS(ctx, req.Config.AuthOts)

The download mechanism however may make multiple attempts - maybe that's what we're seeing. We might want to start sending idempotency tokens/retry count header when we try and download the OTS, just to help with debugging.

geropl · 2022-03-04T10:59:11Z

@csweichel Thx for taking a look.

using version definitely makes sense, although I'm pretty sure it would not have helped in this case: There was at least a ~14s gap between the stopped-with-failed and the stopped-without-failed update. I assume they came with consecutive versions, then. Adding it to webapp make sense anyway, though!
maybe we should have a fallback to never overwrite failed 😬

Created issue #8596 for this.

kylos101 · 2022-03-28T21:59:46Z

Hi @geropl , I've removed this issue from team workspace scheduled work, as well as the highest priority label.

This way @sagor999 can focus on workspace durability, which is going to have a long runway (and I'd argue is a higher priority given this only impacts prebuilds).

This is for two reasons:

You mentioned:

I'm working on merging the DBs, that will resolve it. But there is so much hours in the day. upside_down_face

This was created: [bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

That said, to make sure I am not missing something, do you still need support from team workspace on this issue? I ask so that we can plan accordingly. 🙂

kylos101 · 2022-03-30T21:17:38Z

us38xl2 has been deployed to replace us38xl, and should resolve the surge in OTS errors that we saw over the last week.

Moving this back to scheduled for now. We need to let cluster us38xl2 settle some, to observe if there's still a problem with OTS errors in general (I suspect there is, which is why this issue was originally created).

atduarte · 2022-05-27T11:05:18Z

This continues to be extremely frequent in us46

stale · 2022-09-24T16:33:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

csweichel added type: bug Something isn't working priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team labels Feb 8, 2022

csweichel changed the title ~~Failed to download oTS~~ Failed to download OTS Feb 8, 2022

csweichel added this to 🌌 Workspace Team Feb 8, 2022

kylos101 moved this to Scheduled in 🌌 Workspace Team Feb 9, 2022

sagor999 mentioned this issue Feb 11, 2022

add metrics to track success or failure of downloadOTS requests #8148

Closed

sagor999 mentioned this issue Feb 11, 2022

increase downloadOTS attempts to 20 #8164

Merged

kylos101 added this to 🍎 WebApp Team Feb 12, 2022

kylos101 removed this from 🍎 WebApp Team Feb 12, 2022

kylos101 linked a pull request Feb 12, 2022 that will close this issue

add metrics to track success or failure of downloadOTS requests #8148

Closed

kylos101 assigned sagor999 Feb 12, 2022

kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Feb 12, 2022

roboquat closed this as completed in #8164 Feb 14, 2022

Repository owner moved this from In Progress to Done in 🌌 Workspace Team Feb 14, 2022

sagor999 reopened this Feb 25, 2022

sagor999 moved this from Done to Scheduled in 🌌 Workspace Team Feb 25, 2022

geropl mentioned this issue Mar 3, 2022

Epic: Improve reliability of prebuilds and prebuild logs #7812

Closed

9 tasks

sagor999 mentioned this issue Mar 3, 2022

add warning when all attempts to download OTS have failed #8587

Merged

sagor999 moved this from Scheduled to In Progress in 🌌 Workspace Team Mar 4, 2022

This was referenced Mar 4, 2022

[prebuilds] Failed Prebuilds without Snapshot are marked as 'available' #8592

Closed

Prebuilds failing with "failed OTS download" #8595

Closed

geropl removed the team: webapp Issue belongs to the WebApp team label Mar 4, 2022

geropl mentioned this issue Mar 4, 2022

[bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

Closed

kylos101 removed the status in 🌌 Workspace Team Mar 28, 2022

kylos101 removed this from 🌌 Workspace Team Mar 28, 2022

kylos101 added this to 🌌 Workspace Team Mar 28, 2022

kylos101 removed the priority: highest (user impact) Directly user impacting label Mar 28, 2022

kylos101 moved this to In Progress in 🌌 Workspace Team Mar 30, 2022

kylos101 assigned kylos101 and unassigned sagor999 Mar 30, 2022

svenefftinge changed the title ~~Failed to download OTS~~ Failed to download OTS in US cluster (possibly happens for prebuilds, only) Mar 30, 2022

kylos101 mentioned this issue Mar 30, 2022

[observability] Show workspace success rate by cluster & include prebuilds #9026

Closed

kylos101 moved this from In Progress to Scheduled in 🌌 Workspace Team Mar 30, 2022

kylos101 removed their assignment Mar 30, 2022

csweichel mentioned this issue May 19, 2022

Epic: Get rid of OTS (One-Time Secret) #10134

Closed

12 tasks

atduarte removed this from 🌌 Workspace Team May 27, 2022

stale bot added the meta: stale This issue/PR is stale and will be closed soon label Sep 24, 2022

stale bot closed this as completed Oct 19, 2022

Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096

Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096

Comments

csweichel commented Feb 8, 2022 • edited by svenefftinge Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug description

Steps to reproduce

kylos101 commented Feb 9, 2022

Uh oh!

sagor999 commented Feb 11, 2022

Uh oh!

geropl commented Feb 11, 2022

Uh oh!

sagor999 commented Feb 25, 2022

Uh oh!

geropl commented Feb 28, 2022

Uh oh!

sagor999 commented Feb 28, 2022

Uh oh!

sagor999 commented Mar 2, 2022

Uh oh!

geropl commented Mar 3, 2022

Uh oh!

geropl commented Mar 3, 2022

Uh oh!

sagor999 commented Mar 3, 2022

Uh oh!

geropl commented Mar 4, 2022

Uh oh!

geropl commented Mar 4, 2022

Uh oh!

csweichel commented Mar 4, 2022

Uh oh!

geropl commented Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylos101 commented Mar 28, 2022

Uh oh!

kylos101 commented Mar 30, 2022

Uh oh!

atduarte commented May 27, 2022

Uh oh!

stale bot commented Sep 24, 2022

Uh oh!

csweichel commented Feb 8, 2022 •

edited by svenefftinge

Loading

geropl commented Mar 4, 2022 •

edited

Loading