-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
According to this line we will attempt to download OTS at most 10 times, with 1 second sleep in between each attempt. This PR adds metrics to track this #8148 As for load balancer question, that is probably for webapp team? @gitpod-io/engineering-webapp |
I think we should focus on unifying our DBs into one, so we don't need to expose app cluster identity to users. Especially given how long this has been in prod, and how resource constraint we as a team are (/cc @JanKoehnlein ). As a bandaid, we could extend this timeout to 20s, which should in all but very rare cases plenty of time to include the db-sync roundtrip. |
Reopening as this still occurs:
|
@sagor999 Would it make sense to sync on this issue to speed up investigation? Would love to understand what you discovered so far. |
@geropl I have not dig deep into this issue unfortunately and I think @csweichel have a deeper understanding of the problem.
You previously mentioned that 20 seconds should be enough for OTS to sync between DBs. Can you confirm if that is indeed enough time? |
@geropl how does OTS gets synced between clusters? AFAIU, currently we are failing to download OTS because it was not replicated to the cluster from which we are making request. |
@sagor999 I will allocate some time today to look into this. |
It's configured here, and synced by db-sync. Looking at the prod this seems to work.
I'm working on merging the DBs, that will resolve it. But there is so much hours in the day. 🙃 I had a look at the most recent occurrences in the logs (download into ots.json, extract with
I have to quit now, but see the following ToDos:
Also, to further reduce noise, I suggest splitting the issue into two: one workspace, one webapp. [1] DB query:
|
Thank you @geropl for digging deeper into this!
I don't think actual ws instance is restarted 12 times, but some other process does retry conent-init 12 times it seems like.
where do we lose it? Since in logs I see that workspace did fail. So I assume you talking about the state of it in DB maybe?
We could probably keep this for workspace related fixes. |
Ah, the content-init retry make sense 💡 - the "re-start" might have been me jumping to false conclusons. But why did it continue after Basically the question is: What events where emitted by
@sagor999 Could you:
👍 Here it is: #8595 |
Moved the webapp-parts over to #8595 |
Not all workspace status conditions ws-manager reports are stable. E.g. @geropl ws-manager emits a version field which can be used to order status updates - we are not making use of that in ws-manager-bridge yet. We'll want to start storing that in the DB to ensure we're not "updating" with old status updates for some reason. This would also be very handy to do away with "workspace phase based ordering heuristics". Re content init retries: that's a bug. The OTS
The download mechanism however may make multiple attempts - maybe that's what we're seeing. We might want to start sending idempotency tokens/retry count header when we try and download the OTS, just to help with debugging. |
@csweichel Thx for taking a look.
Created issue #8596 for this. |
Hi @geropl , I've removed this issue from team workspace scheduled work, as well as the highest priority label. This way @sagor999 can focus on workspace durability, which is going to have a long runway (and I'd argue is a higher priority given this only impacts prebuilds). This is for two reasons:
That said, to make sure I am not missing something, do you still need support from team workspace on this issue? I ask so that we can plan accordingly. 🙂 |
Moving this back to scheduled for now. We need to let cluster |
This continues to be extremely frequent in us46 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Uh oh!
There was an error while loading. Please reload this page.
Bug description
Looking at the logs we're seeing a surprising amount of OTS download failures as part of workspace content initialisation: https://console.cloud.google.com/logs/query;query=%22cannot%20download%20OTS%22%0A;timeRange=PT24H;cursorTimestamp=2022-02-08T14:43:32Z?project=workspace-clusters
Each of those failures is likely to yield a failed workspace - at least if the repo was private.
Possible contributing factors:
(edited by Sven)
As part of a fix, we should introduce OTS download failure metrics and keep an eye on them.
Steps to reproduce
Check the logs
The text was updated successfully, but these errors were encountered: