You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Auto merge of #1804 - sgrif:sg-more-resilient-job-runner, r=jtgeibel
Make the job runner a bit more resilient to slow jobs or other errors
A brief incident was caused by #1798. A band-aid fix is in place, and
#1803 (included in this branch) makes it possible to apply similar
band-aids in the future without requiring a rebuild of the code. This
commit attempts to better address the root problem though.
The short version (which is expanded on below, but not required to
understand this commit or why it's needed) is that `update_downloads`
takes longer than our job timeout to run. When we moved that task to a
background job, we did not increase the number of concurrent jobs, nor
did we increase the timeout. This meant that swirl timed out trying to
start new jobs, and our behavior in that case was to crash the process.
This would mean that `update_downloads` never completes, and remains at
the front of the queue. This PR addresses all 3 of the problematic
cases.
- Increasing concurrency
- When this system was added, the only jobs we had were index updates.
These want to be serial, so we set the thread pool size to 1. We
added readme renderings, which probably should have been parallel,
but only happen with crate publishes anyway so it was fine.
`update_downloads` *always* takes longer than the timeout to run
though. We can't have it block everything else while it's running.
The main downside to this is that index updates are no longer
guaranteed to run in serial, which means that if two crates are
uploaded simultaneously one job will fail and will have to wait for
a retry to update the index. In theory if a crate happened to be
uploaded at the exact instant of the retry 7 or 8 times in a row
this could even result in getting paged. This is exceptionally
unlikely, and I'm not concerned about it for now. As more features
land in swirl we may want to move index updates to their own queue
or tweak the retry behavior on that job though.
Swirl will eventually handle this for us by default, and we should
use its defaults once that lands.
- Increasing the default timeout
- 10s was a bit too aggressive. Fundamentally there is always a
condition where we hit this timeout, and if the reason for hitting
it is that we are receiving more jobs than we can process (either
because of volume of jobs, or our jobs are too slow).
The most common reason we would hit this is that all threads are
occupied by a job which takes longer than the timeout to execute.
Increasing the concurrency makes this less likely to occur since our
jobs are low volume, but we were actually seeing this crash before
the addition of `update_downloads` meaning that our other jobs are
sometimes taking >10s to run. Increasing the concurrency beyond 2
would make it extremely unlikely we will ever hit this, but since we
theoretically can with a burst of crate uploads at any concurrency,
I've also upped the timeout.
- Rebuild the runner a few times before crashing the process
- This is the most important change, though it's the only one that
wouldn't fix the problem by itself. The first two changes address
why the problem occurred, this last change addresses why it placed
us in an unrecoverable state.
What would happen is we would time out trying to start another job
after `update_downloads`, and then the process would crash. This
would mean that `update_downloads` would never complete, so as soon
as we restarted, we'd just try to run it again (I may also change
swirl to increment the retry counter before even beginning to run
the job, but there are issues with that which are out of scope for
this commit to discuss).
This commit changes the behavior to instead built a new runner
(which means a new thread pool and DB pool) up to 5 times before
crashing the process. This means that any spawned threads will get a
bit more time to run before the process itself crashes, so any jobs
clogging the runner still get a chance to complete. I've opted to
have a hard limit on the number of failures in the runner to avoid
potentially unbounded growth in DB connections. We do still want to
eventually fail, since being unable to start jobs can indicate
issues that are only solved by starting a new process or moving to
another physical machine.
More specific technical details on the issue that are not required to review this PR, but may be interesting
--
I've written this issue up at sgrif/swirl#16
as well.
The main entry point for a Swirl runner today is `run_all_pending_jobs`.
This method is fairly low level. The intent is to eventually add a
"reasonable defaults" binary shipped with swirl, probably somewhat based
on what crates.io needs here. This method will run in a loop, attempting
to fully saturate its thread pool on each iteration. It will check the
number of availble threads, spawning that many tasks.
Each task that is spawned will quickly communicate back to the
coordinator via an mpsc channel. The coordinator keeps track of how many
messages it's expecting (we get exactly 1 message per spawned task). If
we aren't currently expecting any messages, and there are also 0
available threads, we will attempt to spawn 1 task no matter what. This
is to ensure we don't loop forever waiting for a free thread, and
respsect the given timeout.
We do this in a loop until we hear from a thread that there was no job
available, or receive an error (caused by a thread being unable to get a
DB connection, an error loading the job from the DB [which should only
happen if the DB has gone away], or if we time out waiting to hear back
at all).
That's exactly what happened in this case. We would see 1 available
thread, spawn 1 task, and have 1 pending message. The worker would
communicate back that it got a job. We'd loop. There are 0 available
threads. We are expecting 0 messages, so we spawn 1 task anyway. We are
now expecting 1 pending message. We block waiting for it. The only way
we will receive a message is for the job we started in the first
iteration to complete before the timeout. It doesn't, so
`run_all_pending_jobs` returns an error. Our runner was calling
`.expect` on that, so the process crashes.
This shows several issues both in the configuration that was being used
by crates.io, and also in Swirl itself. I discussed the configuration
issues above, but there are also questions WRT Swirl's design. The first
issue is whether this case should be separated from not getting a
response from the worker at all. The latter should *never* happen under
reasonable circumstances, so my gut is that we can assume if it does
happen it was due to this case...
The second issue is that this was put us in an unrecoverable state
rather than causing one class of issues to fail to run. This could be
prevented by increasing the retry counter outside of a transaction
before running the job. This has issues though, which are out of scope
for this commit, but basically boil down to introducing non-atomic
pieces to an otherwise atomic operation.
0 commit comments