Skip to content

Celery: use on_retry to handle BuildMaxConcurrencyError #8917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 14, 2022

Conversation

humitos
Copy link
Member

@humitos humitos commented Feb 14, 2022

Instead of handling the retry of a task manually when the project hits max
concurrency error, we rely on Celey handlers to do it for us.

By raising the any of the exceptions defined in autoretry_for, Celery will
call on_retry automatically. There, inside on_retry we handle the particular
case for BuildMaxConcurrencyError by setting the error into the build object
and skipping the test to continue running.

See
docs.celeryproject.org/en/master/userguide/tasks.html#automatic-retry-for-known-exceptions
and docs.celeryproject.org/en/master/userguide/tasks.html#Task.autoretry_for

Continuation of #8905

Define Django settings for max concurrency retry and seconds delay to re-use
them acrross the whole codebase.
Instead of handling the retry of a task manually when the project hits max
concurrency error, we rely on Celey handlers to do it for us.

By raising the any of the exceptions defined in `autoretry_for`, Celery will
call `on_retry` automatically. There, inside `on_retry` we handle the particular
case for `BuildMaxConcurrencyError` by setting the error into the build object
and skipping the test to continue running.

See
https://docs.celeryproject.org/en/master/userguide/tasks.html#automatic-retry-for-known-exceptions
and https://docs.celeryproject.org/en/master/userguide/tasks.html#Task.autoretry_for
@humitos humitos requested a review from a team as a code owner February 14, 2022 21:20
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, but I don't really follow what is happening here as it's currently written. Definitely needs some comments or docstrings.

@@ -186,8 +186,8 @@ def prepare_build(
project_slug=project.slug,
version_slug=version.slug,
)
options['countdown'] = 5 * 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this code actually cancel the build? it seems it's just setting options on the task, and an error, but still triggering the build?

Also seems like these aren't required, since it should take this from the task?

Copy link
Member Author

@humitos humitos Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not change the flow of this chunk of code. I just moved these numbers into Django settings.

That said, this code is run in the webs when the build is triggered and it immediately sets a 5 minutes delay because we already know this project is concurrency limited at the moment. If we delete this code, builders will grab each of these triggered tasks and retry them immediately. So, we are just short-circuiting the process here.

Note that it does not cancel the build, it just adds a countdown to be delayed 5 minutes. The error message is to communicate this delay to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting. That makes the comment above make a lot more sense. 👍

@@ -482,7 +471,16 @@ def on_success(self, retval, task_id, args, kwargs):
self.data.build['success'] = True

def on_retry(self, exc, task_id, args, kwargs, einfo):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task could use a docstring. When in the build flow is this triggered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a small docstring explaining when this celery handler is called mentioning both flows: raising a known exception to auto retry; and calling self.retry

exc=BuildMaxConcurrencyError,
# We want to retry this build more times
max_retries=25,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh, I see what you were talking about now with autoretry_for. I wasn't familiar with this attribute when we were last talking about this. I like this syntax, it makes more sense 👍

@@ -186,8 +186,8 @@ def prepare_build(
project_slug=project.slug,
version_slug=version.slug,
)
options['countdown'] = 5 * 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting. That makes the comment above make a lot more sense. 👍

@@ -186,8 +186,8 @@ def prepare_build(
project_slug=project.slug,
version_slug=version.slug,
)
options['countdown'] = 5 * 60
options['max_retries'] = 25
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY
# Delay the start of the build for the build retry delay.
# We're still triggering the task, but it won't run immediately,
# and the user will be alerted in the UI from the Error below.
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY

api_v2.build(self.data.build['id']).patch({
'error': BuildMaxConcurrencyError.message.format(
# By raising this exception and using ``autoretry_for``, Celery
# will handle this automatically calling ``on_retry``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great comment!

@@ -125,6 +125,8 @@ def SESSION_COOKIE_SAMESITE(self):
RTD_STABLE_VERBOSE_NAME = 'stable'
RTD_CLEAN_AFTER_BUILD = False
RTD_MAX_CONCURRENT_BUILDS = 4
RTD_BUILDS_MAX_RETRIES = 25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reasonable default? This probably isn't the place to change it, but it feels like a lot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well. I think you are right that 25 as default for all retry cases is not perfect and we should have a lower number for other cases. However, right now, the only situation where we perform a retry is the concurrency limit reached, as far as I can tell.

But we should expand this to be:

  • RTD_BUILDS_MAX_RETRIES=5
  • RTD_BUILDS_CONCURRENT_LIMIT_MAX_RETRIES=25

or similar in case we need the default value for other cases.

@ericholscher ericholscher merged commit bfe9b05 into master Feb 14, 2022
@ericholscher ericholscher deleted the humitos/celery-max-concurrency branch February 14, 2022 23:44
humitos added a commit that referenced this pull request Feb 15, 2022
This comment was suggested in
#8917 (comment)
but the PR was already merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants