Builds fail trying to `create_container` and/or `remove_container` #7583

minrk · 2020-10-20T07:22:04Z

Details:

Project URL: https://readthedocs.org/projects/jupyterlab/
Build URL(if applicable): https://readthedocs.org/projects/jupyterlab/builds/12139704/
Read the Docs username(if applicable): minrk

Expected Result

Build succeeds, docs are published

Actual Result

Build succeeds, docs are not published. Or an error somewhere is not reported. The build status is

Error
There was a problem with Read the Docs while building your documentation. Please try again later.

but no errors are reported in the logs.

The build log ends with:

The HTML pages are in _build/html.
Updating searchtools for Read the Docs search... /home/docs/checkouts/readthedocs.org/user_builds/jupyterlab/checkouts/latest/docs/source/conf.py:281: RemovedInSphinx40Warning: The app.add_stylesheet() is deprecated. Please use app.add_css_file() instead.
  app.add_stylesheet('custom.css')  # may also be an URL
Command time: 739s Return: 0

This started failing after an update to how the project builds docs:

typedoc-built HTML files are built and staged to the output directory (adds an expensive npm build that could create memory pressure, but succeeds before sphinx builds continue)
conda is used for installation, which included bumping the nodejs runtime to 14.x on $PATH and sphinx from 1.8 to 3.2.1 from the last successful build (12139664)

The text was updated successfully, but these errors were encountered:

Daltz333 · 2020-10-20T13:09:32Z

Use add_css_file instead. https://github.com/jupyterlab/jupyterlab/blob/master/docs/source/conf.py

Daltz333 · 2020-10-20T13:10:23Z

But that's just a deprecation, not sure why it's failing on it.

humitos · 2020-10-20T17:46:04Z

It seems it was a temporal error. It failed when trying to cleanup the docker container used to build the documentation.

I just trigger a new build or latest: https://readthedocs.org/projects/jupyterlab/builds/12146392/ (this may fail because it's using a smaller builder)

Also, it could be that you are hitting the max time allowed to build (your build reports 1311 seconds). So, increasing this time may help here as well. Let's see.

@minrk I think I know what happened. Where you using pip to create the virtualenv and now you are using conda? I guess it was the case. I'm moving your project a bigger builder so it doesn't time out or fail because of memory usage.

I triggered another build https://readthedocs.org/projects/jupyterlab/builds/12146492/ (this should success because it's running in a bigger server).

Please, let me know if the following builds are building OK.

humitos · 2020-10-21T09:36:58Z

Hrm... The new build on the bigger server didn't succeed either. I will need to take a deeper look at this because I'm not sure what's happening here and the logs does not tell me too much about this.

humitos · 2020-10-26T12:39:40Z

OK. I triggered a new build for latest and jumped into the server that started the build. I think found something there. CPU consumption is very high while node is running, however, there is not too much memory usage at that point.

Besides, it seems that it's not one specific process being killed, but the whole VM instance. My SSH session was closed, I don't have access to the same VM anymore and it does not appears under Azure instances anymore. So, it seems it's being killed by Azure for some reason? 🤷‍♂️

humitos · 2020-10-26T14:37:37Z

I looked the VM instance to not be reachable by autoscale rules, just in case, and it wasn't killed by Azure. While doing the build, I checked the memory used by node and I found that it was about ~40Mb and 100% CPU during ~650 seconds (each node step --there are 3).

Then, logs show that the Docker container user for building can't be removed (which our app handles), then ~5 minutes of a celery process at 100% CPU (which I don't understand what's doing), and finally we got a disconnection from Redis that makes the build to be marked as failed with that generic error:

Oct 26 13:56:46 buildlarge000B1E supervisord: build celery.worker.consumer.consumer:338[1617]: WARNING consumer: Connection to broker lost. Trying to re-establish the connection...
Oct 26 13:56:46 buildlarge000B1E supervisord: build Traceback (most recent call last):
Oct 26 13:56:46 buildlarge000B1E supervisord: build   File "/home/docs/lib/python3.6/site-packages/redis/connection.py", line 177, in _read_from_socket
Oct 26 13:56:46 buildlarge000B1E supervisord: build     raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
Oct 26 13:56:46 buildlarge000B1E supervisord: build OSError: Connection closed by server.

humitos · 2020-10-26T15:39:25Z

I was able to reproduce this issue in my local RTD development instance. I do see different calls to the API that fails for this project:

/api/v2/version/<pk>/
/api/v2/build/<pk>/
/api/v2/version/<pk>/ (again)

They retry over and over again and finally they failed with Max retries exceeded and the build is marked as failed with unhandled exception.

These URL are hit after the sphinx-build command has finished and the artifacts where uploaded to the Blob storage (Azurite).

This issue is not present when building any other project locally.

humitos · 2020-10-26T17:16:25Z

OK, thanks to @stsewd that realized this. I think we found the issue and we have a workaround for your case.

Your git repository has ~15k tags and ~30 branches. Read the Docs keeps a sync of those tags/branches and create a Version object for each of them. When RTD tries to sync all of those tags using the API, it results in a timeout (we have a PR that will make this call async for this reason: #7548) and the build hitting the API to update them fails.

I checked your documentation and you have 2.3.x and 1.2.x versions enabled. Both are built from branches. So, I disabled the "sync of tags" for your repository. Let's see if that helps here.

Take into account that new tags won't appear in Read the Docs as versions now and you won't be able to build them for now (at least until we deploy the async PR). If you find yourselves needing these tags, please ping us back and we will try re-enabling the sync of tags with the async PR merged.

All of this does not explain why sometimes it failed with OOM, though.

blink1073 · 2020-10-28T22:33:08Z

It looks like the "sync of tags" workaround didn't fully solve the problem, we're still seeing failing builds that aren't due to OOM, e.g. https://readthedocs.org/projects/jupyterlab/builds/12211330/

humitos · 2020-10-29T11:50:46Z

Yeah... There is something weird happening with Docker, but I'm not sure what it is yet. Our code was already catching some exceptions for this case (remove_container when the build was finished) --which wasn't super important and the build could continue, but now we are getting ReadTimeout and failed as un-catched exception (I opened a PR to add this exception as well). However, I'm seeing similar issues in other projects when calling create_container which we can't simply catch and ignore.

Let's see if that PR that I opened helps for now as a workaround, but the root cause of the problem is still unknown to me. I guess that after an OOM issue in the servers the Docker daemon gets dumb --maybe the OS is killing it? We will need to debug this more deeply, I'm sure.

humitos · 2020-11-10T17:25:43Z

We just deployed #7618 and I triggered a build for jupyterlab. It passed. You can see the output at https://readthedocs.org/projects/jupyterlab/builds/12307459/. I'd be interested about how things keep going on with your project.

blink1073 · 2020-11-10T19:35:46Z

I think we're good, thanks! The last four PR builds also passed.

humitos · 2022-01-27T20:44:02Z

We haven't seen this in a long time now and we did a massive refactor of the build process in #8815. So, I'm closing it. We can come back to it if we find it again and becomes a problem.

humitos added Needed: more information A reply from issue author is required Support Support question labels Oct 20, 2020

humitos mentioned this issue Oct 26, 2020

Build error with build id #12145227 #7600

Closed

blink1073 mentioned this issue Oct 26, 2020

Build packages individually for RTD jupyterlab/jupyterlab#9188

Closed

humitos mentioned this issue Oct 29, 2020

Catch requests.exceptions.ReadTimeout when removing container #7617

Merged

humitos changed the title ~~Build error with build id #12139704~~ Builds fail trying to create_container and/or remove_container Oct 29, 2020

humitos added Accepted Accepted issue on our roadmap Bug A bug and removed Needed: more information A reply from issue author is required Support Support question labels Oct 29, 2020

humitos mentioned this issue Oct 29, 2020

Reserve 1Gb for Application Memory #7618

Merged

humitos closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Builds fail trying to `create_container` and/or `remove_container` #7583

Builds fail trying to `create_container` and/or `remove_container` #7583

minrk commented Oct 20, 2020

Daltz333 commented Oct 20, 2020

Daltz333 commented Oct 20, 2020

humitos commented Oct 20, 2020

humitos commented Oct 21, 2020

humitos commented Oct 26, 2020

humitos commented Oct 26, 2020

humitos commented Oct 26, 2020 •

edited

Loading

humitos commented Oct 26, 2020

blink1073 commented Oct 28, 2020

humitos commented Oct 29, 2020

humitos commented Nov 10, 2020

blink1073 commented Nov 10, 2020

humitos commented Jan 27, 2022

Builds fail trying to create_container and/or remove_container #7583

Builds fail trying to create_container and/or remove_container #7583

Comments

minrk commented Oct 20, 2020

Details:

Expected Result

Actual Result

Daltz333 commented Oct 20, 2020

Daltz333 commented Oct 20, 2020

humitos commented Oct 20, 2020

humitos commented Oct 21, 2020

humitos commented Oct 26, 2020

humitos commented Oct 26, 2020

humitos commented Oct 26, 2020 • edited Loading

humitos commented Oct 26, 2020

blink1073 commented Oct 28, 2020

humitos commented Oct 29, 2020

humitos commented Nov 10, 2020

blink1073 commented Nov 10, 2020

humitos commented Jan 27, 2022

Builds fail trying to `create_container` and/or `remove_container` #7583

Builds fail trying to `create_container` and/or `remove_container` #7583

humitos commented Oct 26, 2020 •

edited

Loading