Skip to content

Builds fail trying to create_container and/or remove_container #7583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
minrk opened this issue Oct 20, 2020 · 13 comments
Closed

Builds fail trying to create_container and/or remove_container #7583

minrk opened this issue Oct 20, 2020 · 13 comments
Labels
Accepted Accepted issue on our roadmap Bug A bug

Comments

@minrk
Copy link
Contributor

minrk commented Oct 20, 2020

Details:

Expected Result

Build succeeds, docs are published

Actual Result

Build succeeds, docs are not published. Or an error somewhere is not reported. The build status is

Error
There was a problem with Read the Docs while building your documentation. Please try again later.

but no errors are reported in the logs.

The build log ends with:

The HTML pages are in _build/html.
Updating searchtools for Read the Docs search... /home/docs/checkouts/readthedocs.org/user_builds/jupyterlab/checkouts/latest/docs/source/conf.py:281: RemovedInSphinx40Warning: The app.add_stylesheet() is deprecated. Please use app.add_css_file() instead.
  app.add_stylesheet('custom.css')  # may also be an URL
Command time: 739s Return: 0

This started failing after an update to how the project builds docs:

  1. typedoc-built HTML files are built and staged to the output directory (adds an expensive npm build that could create memory pressure, but succeeds before sphinx builds continue)
  2. conda is used for installation, which included bumping the nodejs runtime to 14.x on $PATH and sphinx from 1.8 to 3.2.1 from the last successful build (12139664)
@Daltz333
Copy link

@Daltz333
Copy link

But that's just a deprecation, not sure why it's failing on it.

@humitos
Copy link
Member

humitos commented Oct 20, 2020

It seems it was a temporal error. It failed when trying to cleanup the docker container used to build the documentation.

I just trigger a new build or latest: https://readthedocs.org/projects/jupyterlab/builds/12146392/ (this may fail because it's using a smaller builder)

Also, it could be that you are hitting the max time allowed to build (your build reports 1311 seconds). So, increasing this time may help here as well. Let's see.

@minrk I think I know what happened. Where you using pip to create the virtualenv and now you are using conda? I guess it was the case. I'm moving your project a bigger builder so it doesn't time out or fail because of memory usage.

I triggered another build https://readthedocs.org/projects/jupyterlab/builds/12146492/ (this should success because it's running in a bigger server).

Please, let me know if the following builds are building OK.

@humitos humitos added Needed: more information A reply from issue author is required Support Support question labels Oct 20, 2020
@humitos
Copy link
Member

humitos commented Oct 21, 2020

Hrm... The new build on the bigger server didn't succeed either. I will need to take a deeper look at this because I'm not sure what's happening here and the logs does not tell me too much about this.

@humitos
Copy link
Member

humitos commented Oct 26, 2020

OK. I triggered a new build for latest and jumped into the server that started the build. I think found something there. CPU consumption is very high while node is running, however, there is not too much memory usage at that point.

Besides, it seems that it's not one specific process being killed, but the whole VM instance. My SSH session was closed, I don't have access to the same VM anymore and it does not appears under Azure instances anymore. So, it seems it's being killed by Azure for some reason? 🤷‍♂️

@humitos
Copy link
Member

humitos commented Oct 26, 2020

I looked the VM instance to not be reachable by autoscale rules, just in case, and it wasn't killed by Azure. While doing the build, I checked the memory used by node and I found that it was about ~40Mb and 100% CPU during ~650 seconds (each node step --there are 3).

Then, logs show that the Docker container user for building can't be removed (which our app handles), then ~5 minutes of a celery process at 100% CPU (which I don't understand what's doing), and finally we got a disconnection from Redis that makes the build to be marked as failed with that generic error:

Oct 26 13:56:46 buildlarge000B1E supervisord: build celery.worker.consumer.consumer:338[1617]: WARNING consumer: Connection to broker lost. Trying to re-establish the connection...
Oct 26 13:56:46 buildlarge000B1E supervisord: build Traceback (most recent call last):
Oct 26 13:56:46 buildlarge000B1E supervisord: build   File "/home/docs/lib/python3.6/site-packages/redis/connection.py", line 177, in _read_from_socket
Oct 26 13:56:46 buildlarge000B1E supervisord: build     raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
Oct 26 13:56:46 buildlarge000B1E supervisord: build OSError: Connection closed by server.

@humitos
Copy link
Member

humitos commented Oct 26, 2020

I was able to reproduce this issue in my local RTD development instance. I do see different calls to the API that fails for this project:

  • /api/v2/version/<pk>/
  • /api/v2/build/<pk>/
  • /api/v2/version/<pk>/ (again)

They retry over and over again and finally they failed with Max retries exceeded and the build is marked as failed with unhandled exception.

These URL are hit after the sphinx-build command has finished and the artifacts where uploaded to the Blob storage (Azurite).

This issue is not present when building any other project locally.

@humitos
Copy link
Member

humitos commented Oct 26, 2020

OK, thanks to @stsewd that realized this. I think we found the issue and we have a workaround for your case.

Your git repository has ~15k tags and ~30 branches. Read the Docs keeps a sync of those tags/branches and create a Version object for each of them. When RTD tries to sync all of those tags using the API, it results in a timeout (we have a PR that will make this call async for this reason: #7548) and the build hitting the API to update them fails.

I checked your documentation and you have 2.3.x and 1.2.x versions enabled. Both are built from branches. So, I disabled the "sync of tags" for your repository. Let's see if that helps here.

Take into account that new tags won't appear in Read the Docs as versions now and you won't be able to build them for now (at least until we deploy the async PR). If you find yourselves needing these tags, please ping us back and we will try re-enabling the sync of tags with the async PR merged.

All of this does not explain why sometimes it failed with OOM, though.

@blink1073
Copy link

It looks like the "sync of tags" workaround didn't fully solve the problem, we're still seeing failing builds that aren't due to OOM, e.g. https://readthedocs.org/projects/jupyterlab/builds/12211330/

@humitos
Copy link
Member

humitos commented Oct 29, 2020

Yeah... There is something weird happening with Docker, but I'm not sure what it is yet. Our code was already catching some exceptions for this case (remove_container when the build was finished) --which wasn't super important and the build could continue, but now we are getting ReadTimeout and failed as un-catched exception (I opened a PR to add this exception as well). However, I'm seeing similar issues in other projects when calling create_container which we can't simply catch and ignore.

Let's see if that PR that I opened helps for now as a workaround, but the root cause of the problem is still unknown to me. I guess that after an OOM issue in the servers the Docker daemon gets dumb --maybe the OS is killing it? We will need to debug this more deeply, I'm sure.

@humitos humitos changed the title Build error with build id #12139704 Builds fail trying to create_container and/or remove_container Oct 29, 2020
@humitos humitos added Accepted Accepted issue on our roadmap Bug A bug and removed Needed: more information A reply from issue author is required Support Support question labels Oct 29, 2020
@humitos
Copy link
Member

humitos commented Nov 10, 2020

We just deployed #7618 and I triggered a build for jupyterlab. It passed. You can see the output at https://readthedocs.org/projects/jupyterlab/builds/12307459/. I'd be interested about how things keep going on with your project.

@blink1073
Copy link

I think we're good, thanks! The last four PR builds also passed.

@humitos
Copy link
Member

humitos commented Jan 27, 2022

We haven't seen this in a long time now and we did a massive refactor of the build process in #8815. So, I'm closing it. We can come back to it if we find it again and becomes a problem.

@humitos humitos closed this as completed Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Bug A bug
Projects
None yet
Development

No branches or pull requests

4 participants