-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Celery: use on_retry
to handle BuildMaxConcurrencyError
#8917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Define Django settings for max concurrency retry and seconds delay to re-use them acrross the whole codebase.
Instead of handling the retry of a task manually when the project hits max concurrency error, we rely on Celey handlers to do it for us. By raising the any of the exceptions defined in `autoretry_for`, Celery will call `on_retry` automatically. There, inside `on_retry` we handle the particular case for `BuildMaxConcurrencyError` by setting the error into the build object and skipping the test to continue running. See https://docs.celeryproject.org/en/master/userguide/tasks.html#automatic-retry-for-known-exceptions and https://docs.celeryproject.org/en/master/userguide/tasks.html#Task.autoretry_for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, but I don't really follow what is happening here as it's currently written. Definitely needs some comments or docstrings.
@@ -186,8 +186,8 @@ def prepare_build( | |||
project_slug=project.slug, | |||
version_slug=version.slug, | |||
) | |||
options['countdown'] = 5 * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this code actually cancel the build? it seems it's just setting options on the task, and an error
, but still triggering the build?
Also seems like these aren't required, since it should take this from the task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does not change the flow of this chunk of code. I just moved these numbers into Django settings.
That said, this code is run in the webs when the build is triggered and it immediately sets a 5 minutes delay because we already know this project is concurrency limited at the moment. If we delete this code, builders will grab each of these triggered tasks and retry them immediately. So, we are just short-circuiting the process here.
Note that it does not cancel the build, it just adds a countdown to be delayed 5 minutes. The error message is to communicate this delay to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting. That makes the comment above make a lot more sense. 👍
@@ -482,7 +471,16 @@ def on_success(self, retval, task_id, args, kwargs): | |||
self.data.build['success'] = True | |||
|
|||
def on_retry(self, exc, task_id, args, kwargs, einfo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task could use a docstring. When in the build flow is this triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a small docstring explaining when this celery handler is called mentioning both flows: raising a known exception to auto retry; and calling self.retry
exc=BuildMaxConcurrencyError, | ||
# We want to retry this build more times | ||
max_retries=25, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooooh, I see what you were talking about now with autoretry_for
. I wasn't familiar with this attribute when we were last talking about this. I like this syntax, it makes more sense 👍
@@ -186,8 +186,8 @@ def prepare_build( | |||
project_slug=project.slug, | |||
version_slug=version.slug, | |||
) | |||
options['countdown'] = 5 * 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting. That makes the comment above make a lot more sense. 👍
@@ -186,8 +186,8 @@ def prepare_build( | |||
project_slug=project.slug, | |||
version_slug=version.slug, | |||
) | |||
options['countdown'] = 5 * 60 | |||
options['max_retries'] = 25 | |||
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY | |
# Delay the start of the build for the build retry delay. | |
# We're still triggering the task, but it won't run immediately, | |
# and the user will be alerted in the UI from the Error below. | |
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY |
api_v2.build(self.data.build['id']).patch({ | ||
'error': BuildMaxConcurrencyError.message.format( | ||
# By raising this exception and using ``autoretry_for``, Celery | ||
# will handle this automatically calling ``on_retry`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great comment!
@@ -125,6 +125,8 @@ def SESSION_COOKIE_SAMESITE(self): | |||
RTD_STABLE_VERBOSE_NAME = 'stable' | |||
RTD_CLEAN_AFTER_BUILD = False | |||
RTD_MAX_CONCURRENT_BUILDS = 4 | |||
RTD_BUILDS_MAX_RETRIES = 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a reasonable default? This probably isn't the place to change it, but it feels like a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well. I think you are right that 25 as default for all retry cases is not perfect and we should have a lower number for other cases. However, right now, the only situation where we perform a retry is the concurrency limit reached, as far as I can tell.
But we should expand this to be:
RTD_BUILDS_MAX_RETRIES=5
RTD_BUILDS_CONCURRENT_LIMIT_MAX_RETRIES=25
or similar in case we need the default value for other cases.
This comment was suggested in #8917 (comment) but the PR was already merged.
Instead of handling the retry of a task manually when the project hits max
concurrency error, we rely on Celey handlers to do it for us.
By raising the any of the exceptions defined in
autoretry_for
, Celery willcall
on_retry
automatically. There, insideon_retry
we handle the particularcase for
BuildMaxConcurrencyError
by setting the error into the build objectand skipping the test to continue running.
See
docs.celeryproject.org/en/master/userguide/tasks.html#automatic-retry-for-known-exceptions
and docs.celeryproject.org/en/master/userguide/tasks.html#Task.autoretry_for
Continuation of #8905