Skip to content

Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
csweichel opened this issue Feb 8, 2022 · 18 comments · Fixed by #8164
Closed

Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096

csweichel opened this issue Feb 8, 2022 · 18 comments · Fixed by #8164
Labels
meta: stale This issue/PR is stale and will be closed soon team: workspace Issue belongs to the Workspace team type: bug Something isn't working

Comments

@csweichel
Copy link
Contributor

csweichel commented Feb 8, 2022

Bug description

Looking at the logs we're seeing a surprising amount of OTS download failures as part of workspace content initialisation: https://console.cloud.google.com/logs/query;query=%22cannot%20download%20OTS%22%0A;timeRange=PT24H;cursorTimestamp=2022-02-08T14:43:32Z?project=workspace-clusters

Each of those failures is likely to yield a failed workspace - at least if the repo was private.
Possible contributing factors:

(edited by Sven)

  • as this seems to happen only in prebuilds (only US cluster and lots of this error messages in d_b_prebuild_workspace), it could be that for some reason the time between the OTS is created and when it gets requested is longer than 30min (the lifetime of a token). Prebuild clusters are sometimes heavily packed so maybe there is just too much time in scaling up etc.
  • we attempt to download the OTS multiple times for some reason. That's most likely a bug in the initializer. Checking the server logs and/or adding metrics would help identifying this.

As part of a fix, we should introduce OTS download failure metrics and keep an eye on them.

Steps to reproduce

Check the logs

@csweichel csweichel added type: bug Something isn't working priority: highest (user impact) Directly user impacting team: webapp Issue belongs to the WebApp team team: workspace Issue belongs to the Workspace team labels Feb 8, 2022
@csweichel csweichel changed the title Failed to download oTS Failed to download OTS Feb 8, 2022
@kylos101
Copy link
Contributor

kylos101 commented Feb 9, 2022

Also interesting.

@sagor999
Copy link
Contributor

for i := 0; i < otsDownloadAttempts; i++ {

According to this line we will attempt to download OTS at most 10 times, with 1 second sleep in between each attempt.
This PR adds metrics to track this #8148

As for load balancer question, that is probably for webapp team? @gitpod-io/engineering-webapp

@geropl
Copy link
Member

geropl commented Feb 11, 2022

As for load balancer question, that is probably for webapp team?

I think we should focus on unifying our DBs into one, so we don't need to expose app cluster identity to users. Especially given how long this has been in prod, and how resource constraint we as a team are (/cc @JanKoehnlein ).

As a bandaid, we could extend this timeout to 20s, which should in all but very rare cases plenty of time to include the db-sync roundtrip.

@kylos101 kylos101 moved this from Scheduled to In Progress in 🌌 Workspace Team Feb 12, 2022
Repository owner moved this from In Progress to Done in 🌌 Workspace Team Feb 14, 2022
@sagor999
Copy link
Contributor

Reopening as this still occurs:
https://twitter.com/srherobrine23/status/1497265903641116675?s=20&t=Taai1veK0dkGdYsmYqKnvw
WS id: "1af7b6dd-37c0-4c7b-b9e8-a94858e5f339"
Logs:

"cannot download OTS"
"content init failed"
"cannot initialize workspace"
"InitWorkspace failed"

@sagor999 sagor999 reopened this Feb 25, 2022
@sagor999 sagor999 moved this from Done to Scheduled in 🌌 Workspace Team Feb 25, 2022
@geropl
Copy link
Member

geropl commented Feb 28, 2022

@sagor999 Would it make sense to sync on this issue to speed up investigation? Would love to understand what you discovered so far.

@sagor999
Copy link
Contributor

@geropl I have not dig deep into this issue unfortunately and I think @csweichel have a deeper understanding of the problem.
But from what I understood, we have several possible scenarios:

  1. OTS never gets created
  2. It takes longer than 20 seconds for OTS to sync between databases, causes OTS timeout to trigger.
  3. Maybe we somehow attempt to read same OTS twice, causing second read to always fail.

You previously mentioned that 20 seconds should be enough for OTS to sync between DBs. Can you confirm if that is indeed enough time?

@sagor999
Copy link
Contributor

sagor999 commented Mar 2, 2022

@geropl how does OTS gets synced between clusters? AFAIU, currently we are failing to download OTS because it was not replicated to the cluster from which we are making request.
Is there a way to speed up OTS reconcile process? Or is there some sort of guarantee time wise how long it might take to run reconcile?

@geropl
Copy link
Member

geropl commented Mar 3, 2022

@sagor999 I will allocate some time today to look into this.

@geropl
Copy link
Member

geropl commented Mar 3, 2022

how does OTS gets synced between clusters?

It's configured here, and synced by db-sync. Looking at the prod this seems to work.

Is there a way to speed up OTS reconcile process? Or is there some sort of guarantee time wise how long it might take to run reconcile?

I'm working on merging the DBs, that will resolve it. But there is so much hours in the day. 🙃

I had a look at the most recent occurrences in the logs (download into ots.json, extract with cat ots.json | jq '.[] | .jsonPayload.workspaceId' | sort | uniq)[1]:

  • all reported cases are prebuilds (GitHub, but also Gitlab)
  • only 1 in 40 prebuilds are marked with "error" despite all the warnings. error: "failed OTS download"
  • ~ 50% of these prebuilds are private
  • all are starting 0.5-2 minutes after they have been created: so the OTS timing out seems to be not the issue
  • I looked into two in detail:
    • private: true, prebuild marked as "available", no errors (logs): it tries to get the OTS token 240 times, from 11:02:59 to 12:08:05 when it finally succeeds. Judging by DB, the pod has been running only from 11:01:49 to 11:01:50 (:point_left: suspicious), but does not report any error whatsoever.

      • ws-manager logs indicate the workspace fails immediately (as do the DB timings)
      • looking at the ws-manager-bridge traces, we:
        • see a first event with the error container workspace ran with an error: exit code 1 (which correponds to the ws-manager logs): link (cmp. status and after fields)
        • but 14s later see we get another event that just says "regularly stopped": link (cmp. status and after fields)
    • private: true, prebuild "aborted", shows OTS error (logs): fails after the first 20 tries

I have to quit now, but see the following ToDos:

  • workspace: why do we re-start instances 12 times?
  • workspace: why do we loose the "failed" state? (maybe because of the "restart"? 🤔 )
  • workspace: maybe add metrics/improve logging so it's easier to distinguish between "temporary retry" and "failed OTS download"
  • webapp: investigate the "cannot get OTS" cases, starting with those found in the DB. I expect the re-starts ☝️ to shadow some of these as well

Also, to further reduce noise, I suggest splitting the issue into two: one workspace, one webapp.
@sagor999 @kylos101 WDYT?

[1] DB query:

SELECT ws.type, pws.state, pws.error, ws.context->>'$.repository.private', ws.creationTime, wsi.region, wsi.startedTime, wsi.stoppingTime, wsi.status, ws.*, wsi.*
   FROM d_b_workspace AS ws
   JOIN d_b_workspace_instance AS wsi
   	ON wsi.workspaceId = ws.id
   JOIN d_b_prebuilt_workspace AS pws
   	ON pws.buildWorkspaceId = ws.id
   WHERE ws.id IN ( ... );

@sagor999
Copy link
Contributor

sagor999 commented Mar 3, 2022

Thank you @geropl for digging deeper into this!

workspace: why do we re-start instances 12 times?

I don't think actual ws instance is restarted 12 times, but some other process does retry conent-init 12 times it seems like.
@csweichel in case if you have some comments regarding this. Going through code I couldn't quite figure out where that might be happening though.

workspace: maybe add metrics/improve logging so it's easier to distinguish between "temporary retry" and "failed OTS download"

#8587

workspace: why do we loose the "failed" state? (maybe because of the "restart"? thinking )

where do we lose it? Since in logs I see that workspace did fail. So I assume you talking about the state of it in DB maybe?

Also, to further reduce noise, I suggest splitting the issue into two: one workspace, one webapp.

We could probably keep this for workspace related fixes.
And then create a new one for webapp related work.

@geropl
Copy link
Member

geropl commented Mar 4, 2022

I don't think actual ws instance is restarted 12 times, but some other process does retry conent-init 12 times it seems like.

where do we lose it? Since in logs I see that workspace did fail. So I assume you talking about the state of it in DB maybe?

Ah, the content-init retry make sense 💡 - the "re-start" might have been me jumping to false conclusons. But why did it continue after ws-manager reported the workspace to be terminated/stopped? 🤔

Basically the question is: What events where emitted by ws-manager, in what order, to make ws-manager-bridge:

  1. think the workspace stopped 1s after it started, incl. failed state
  2. later receive new events for the same instance that override the failed state
    (both is documented in the traces mentioned above [1, 2])

@sagor999 Could you:

  • pull out the ws-manager trace logs (on first glance they look mangled?)
  • correlate those with the honecomb traces (what ws-manager-bridge receives)
  • and find out why either:
    • content initialization is not stopped when the workspace is terminating
    • OR: the workspace is reported as terminating although content initialiaztion is still running
    • where/ the updates without "failed" state are generated
      (I guess all of this is tighly coupled)

We could probably keep this for workspace related fixes.
And then create a new one for webapp related work.

👍 Here it is: #8595

@geropl geropl removed the team: webapp Issue belongs to the WebApp team label Mar 4, 2022
@geropl
Copy link
Member

geropl commented Mar 4, 2022

Moved the webapp-parts over to #8595

@csweichel
Copy link
Contributor Author

Not all workspace status conditions ws-manager reports are stable. E.g. failed can be set in one update, and not in another. That's undesirable and we should consider that a bug, but it is the current behaviour.

@geropl ws-manager emits a version field which can be used to order status updates - we are not making use of that in ws-manager-bridge yet. We'll want to start storing that in the DB to ensure we're not "updating" with old status updates for some reason. This would also be very handy to do away with "workspace phase based ordering heuristics".

Re content init retries: that's a bug. The OTS is should be downloaded once per initializer run only

user, pwd, err = downloadOTS(ctx, req.Config.AuthOts)

The download mechanism however may make multiple attempts - maybe that's what we're seeing. We might want to start sending idempotency tokens/retry count header when we try and download the OTS, just to help with debugging.

@geropl
Copy link
Member

geropl commented Mar 4, 2022

@csweichel Thx for taking a look.

  • using version definitely makes sense, although I'm pretty sure it would not have helped in this case: There was at least a ~14s gap between the stopped-with-failed and the stopped-without-failed update. I assume they came with consecutive versions, then. Adding it to webapp make sense anyway, though!
  • maybe we should have a fallback to never overwrite failed 😬

Created issue #8596 for this.

@kylos101
Copy link
Contributor

Hi @geropl , I've removed this issue from team workspace scheduled work, as well as the highest priority label.

This way @sagor999 can focus on workspace durability, which is going to have a long runway (and I'd argue is a higher priority given this only impacts prebuilds).

This is for two reasons:

  1. You mentioned:

I'm working on merging the DBs, that will resolve it. But there is so much hours in the day. upside_down_face

  1. This was created: [bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

That said, to make sure I am not missing something, do you still need support from team workspace on this issue? I ask so that we can plan accordingly. 🙂

@kylos101 kylos101 moved this to In Progress in 🌌 Workspace Team Mar 30, 2022
@kylos101 kylos101 assigned kylos101 and unassigned sagor999 Mar 30, 2022
@svenefftinge svenefftinge changed the title Failed to download OTS Failed to download OTS in US cluster (possibly happens for prebuilds, only) Mar 30, 2022
@kylos101 kylos101 moved this from In Progress to Scheduled in 🌌 Workspace Team Mar 30, 2022
@kylos101
Copy link
Contributor

us38xl2 has been deployed to replace us38xl, and should resolve the surge in OTS errors that we saw over the last week.

Moving this back to scheduled for now. We need to let cluster us38xl2 settle some, to observe if there's still a problem with OTS errors in general (I suspect there is, which is why this issue was originally created).

@kylos101 kylos101 removed their assignment Mar 30, 2022
@atduarte
Copy link
Contributor

This continues to be extremely frequent in us46

@stale
Copy link

stale bot commented Sep 24, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Sep 24, 2022
@stale stale bot closed this as completed Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta: stale This issue/PR is stale and will be closed soon team: workspace Issue belongs to the Workspace team type: bug Something isn't working
Projects
None yet
5 participants