Skip to content

Celery: use on_retry to handle BuildMaxConcurrencyError #8917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 14, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions readthedocs/core/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,8 +186,8 @@ def prepare_build(
project_slug=project.slug,
version_slug=version.slug,
)
options['countdown'] = 5 * 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this code actually cancel the build? it seems it's just setting options on the task, and an error, but still triggering the build?

Also seems like these aren't required, since it should take this from the task?

Copy link
Member Author

@humitos humitos Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not change the flow of this chunk of code. I just moved these numbers into Django settings.

That said, this code is run in the webs when the build is triggered and it immediately sets a 5 minutes delay because we already know this project is concurrency limited at the moment. If we delete this code, builders will grab each of these triggered tasks and retry them immediately. So, we are just short-circuiting the process here.

Note that it does not cancel the build, it just adds a countdown to be delayed 5 minutes. The error message is to communicate this delay to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting. That makes the comment above make a lot more sense. 👍

options['max_retries'] = 25
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY
# Delay the start of the build for the build retry delay.
# We're still triggering the task, but it won't run immediately,
# and the user will be alerted in the UI from the Error below.
options['countdown'] = settings.RTD_BUILDS_RETRY_DELAY

options['max_retries'] = settings.RTD_BUILDS_MAX_RETRIES
build.error = BuildMaxConcurrencyError.message.format(
limit=max_concurrent_builds,
)
Expand Down
40 changes: 19 additions & 21 deletions readthedocs/projects/tasks/builds.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
BUILD_STATE_CLONING,
BUILD_STATE_FINISHED,
BUILD_STATE_INSTALLING,
BUILD_STATE_TRIGGERED,
BUILD_STATE_UPLOADING,
BUILD_STATUS_FAILURE,
BUILD_STATUS_SUCCESS,
Expand Down Expand Up @@ -217,8 +218,9 @@ class UpdateDocsTask(SyncRepositoryMixin, Task):
autoretry_for = (
BuildMaxConcurrencyError,
)
max_retries = 5 # 5 per normal builds, 25 per concurrency limited
default_retry_delay = 7 * 60
max_retries = settings.RTD_BUILDS_MAX_RETRIES
default_retry_delay = settings.RTD_BUILDS_RETRY_DELAY
retry_backoff = False

# Expected exceptions that will be logged as info only and not retried
throws = (
Expand Down Expand Up @@ -263,25 +265,12 @@ def _check_concurrency_limit(self):
max_concurrent_builds = settings.RTD_MAX_CONCURRENT_BUILDS

if concurrency_limit_reached:
# TODO: this could be handled in `on_retry` probably
log.warning(
'Delaying tasks due to concurrency limit.',
project_slug=self.data.project.slug,
version_slug=self.data.version.slug,
)

# This is done automatically on the environment context, but
# we are executing this code before creating one
api_v2.build(self.data.build['id']).patch({
'error': BuildMaxConcurrencyError.message.format(
# By raising this exception and using ``autoretry_for``, Celery
# will handle this automatically calling ``on_retry``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great comment!

raise BuildMaxConcurrencyError(
BuildMaxConcurrencyError.message.format(
limit=max_concurrent_builds,
),
'builder': socket.gethostname(),
})
self.retry(
exc=BuildMaxConcurrencyError,
# We want to retry this build more times
max_retries=25,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh, I see what you were talking about now with autoretry_for. I wasn't familiar with this attribute when we were last talking about this. I like this syntax, it makes more sense 👍

)

def _check_duplicated_build(self):
Expand Down Expand Up @@ -482,7 +471,16 @@ def on_success(self, retval, task_id, args, kwargs):
self.data.build['success'] = True

def on_retry(self, exc, task_id, args, kwargs, einfo):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This task could use a docstring. When in the build flow is this triggered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a small docstring explaining when this celery handler is called mentioning both flows: raising a known exception to auto retry; and calling self.retry

log.warning('Retrying this task.')
log.info('Retrying this task.')

if isinstance(exc, BuildMaxConcurrencyError):
log.warning(
'Delaying tasks due to concurrency limit.',
project_slug=self.data.project.slug,
version_slug=self.data.version.slug,
)
self.data.build['error'] = exc.message
self.update_build(state=BUILD_STATE_TRIGGERED)

def after_return(self, status, retval, task_id, args, kwargs, einfo):
# Update build object
Expand Down
2 changes: 2 additions & 0 deletions readthedocs/settings/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,8 @@ def SESSION_COOKIE_SAMESITE(self):
RTD_STABLE_VERBOSE_NAME = 'stable'
RTD_CLEAN_AFTER_BUILD = False
RTD_MAX_CONCURRENT_BUILDS = 4
RTD_BUILDS_MAX_RETRIES = 25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reasonable default? This probably isn't the place to change it, but it feels like a lot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well. I think you are right that 25 as default for all retry cases is not perfect and we should have a lower number for other cases. However, right now, the only situation where we perform a retry is the concurrency limit reached, as far as I can tell.

But we should expand this to be:

  • RTD_BUILDS_MAX_RETRIES=5
  • RTD_BUILDS_CONCURRENT_LIMIT_MAX_RETRIES=25

or similar in case we need the default value for other cases.

RTD_BUILDS_RETRY_DELAY = 5 * 60 # seconds
RTD_BUILD_STATUS_API_NAME = 'docs/readthedocs'
RTD_ANALYTICS_DEFAULT_RETENTION_DAYS = 30 * 3
RTD_AUDITLOGS_DEFAULT_RETENTION_DAYS = 30 * 3
Expand Down