Skip to content

Build: cloning multibranch repository takes too long #9736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
humitos opened this issue Nov 16, 2022 · 14 comments · Fixed by #10430
Closed

Build: cloning multibranch repository takes too long #9736

humitos opened this issue Nov 16, 2022 · 14 comments · Fixed by #10430
Assignees
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code

Comments

@humitos
Copy link
Member

humitos commented Nov 16, 2022

I just noticed that there are some projects where it takes too long to clone the repository, ~500 seconds, but the build time is just a few of seconds. I found this scenario in ccxt project: https://readthedocs.org/projects/ccxt/builds/18648371/ 1

Screenshot_2022-11-16_12-36-02

We've done some improvements in the past by adding --depth 50 so we don't bring the whole history. However, we are using --no-single-branch, which will bring the 50 last commits of all the branches in the repository. That's what I understand from the manpage:

--depth
Create a shallow clone with a history truncated to the specified number of commits. Implies
--single-branch unless --no-single-branch is given to fetch the histories near the tips of all
branches. If you want to clone submodules shallowly, also pass --shallow-submodules.
bring the history near the tips of all the branches

Since we are building always from one and only one branch, we could probably remove the --no-single-branch and only fetch the latest 50 commits for the branch we are actually going to build.

Example

Cloning with depth 1 and single branch, it takes 5 seconds

$ time git clone --depth 1 https://github.com/ccxt/ccxt
Cloning into 'ccxt'...
remote: Enumerating objects: 1677, done.
remote: Counting objects: 100% (1677/1677), done.
remote: Compressing objects: 100% (1248/1248), done.
remote: Total 1677 (delta 786), reused 644 (delta 397), pack-reused 0
Receiving objects: 100% (1677/1677), 8.54 MiB | 5.38 MiB/s, done.
Resolving deltas: 100% (786/786), done.
git clone --depth 1 https://github.com/ccxt/ccxt  1.01s user 0.21s system 24% cpu 4.912 total
Time: 0h:00m:05s

Cloning with depth 1 but using --no-single-branch, it takes 1 minutes 37 seconds. Which is almost 20x slower.

$ time git clone --no-single-branch --depth 1 https://github.com/ccxt/ccxt
Cloning into 'ccxt'...
remote: Enumerating objects: 57663, done.
remote: Counting objects: 100% (57663/57663), done.
remote: Compressing objects: 100% (23836/23836), done.
remote: Total 57663 (delta 52438), reused 36909 (delta 33131), pack-reused 0
Receiving objects: 100% (57663/57663), 107.62 MiB | 5.27 MiB/s, done.
Resolving deltas: 100% (52438/52438), done.
git clone --no-single-branch --depth 1 https://github.com/ccxt/ccxt  35.18s user 2.10s system 38% cpu 1:36.33 total
Time: 0h:01m:37s

Note that this increases to the 5 minutes saw in the Read the Docs build because it uses --depth 50

Proposal

  • Remove the usage of --no-single-branch
  • Keep using --depth 50
  • Add --branch <branch>
  • Remove the step git checkout --force <branch>

This way, we get the best outcome by downloading only the exact data needed to perform the build using one branch. Since this is the most general use case 2, we will be saving lot of bandwidth here and also improving the UX for our users:

$ time git clone --depth 1 --branch parallel-build https://github.com/ccxt/ccxt
Cloning into 'ccxt'...
remote: Enumerating objects: 1686, done.
remote: Counting objects: 100% (1686/1686), done.
remote: Compressing objects: 100% (1236/1236), done.
remote: Total 1686 (delta 804), reused 653 (delta 418), pack-reused 0
Receiving objects: 100% (1686/1686), 8.76 MiB | 4.69 MiB/s, done.
Resolving deltas: 100% (804/804), done.
git clone --depth 1 --branch parallel-build https://github.com/ccxt/ccxt  0.93s user 0.20s system 21% cpu 5.347 total
Time: 0h:00m:05s

After executing this command, the repository will be in the parallel-build branch already. It will be ready to build the documentation.

Footnotes

  1. noticed this on Metabase because this project showed a high number of build time compared to others

  2. users wanting to do something different can always make usage of build.jobs or build.commands in case they need more branches in their repository

@humitos humitos moved this to Planned in 📍Roadmap Nov 16, 2022
@humitos humitos added the Needed: design decision A core team decision is required label Nov 16, 2022
@ericholscher
Copy link
Member

@humitos I'm guessing we're doing the latest branches because we're reusing this code to sync versions. Does this change impact that functionality at all?

@humitos
Copy link
Member Author

humitos commented Nov 16, 2022

@ericholscher it should not affect the sync versions code since we use ls-remote for that:

use_lsremote = (
vcs_repository.supports_lsremote
and not vcs_repository.repo_exists()
and self.data.project.has_feature(Feature.VCS_REMOTE_LISTING)
)

It's currently under a feature flag, but it's enabled for all the projects 1 (and it's targeted to be deleted 😄 and make it the default behavior )

Footnotes

  1. noticed it's enabled only on community but it's not enabled on commercial. We should enable it there as well.

@ericholscher
Copy link
Member

Yea, that seems like step 1 then, then I'm 👍 on this approach. Probably deploy it behind a feature flag as well tho, just in case it breaks stuff for a small subset of projects somehow.

@humitos
Copy link
Member Author

humitos commented Nov 24, 2022

I enabled the feature flag 7 days ago in .com and things keep looking great. So, I think we are good to move forward with this implementation when we have the time.

@humitos humitos added Improvement Minor improvement to code Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Nov 24, 2022
@humitos
Copy link
Member Author

humitos commented Jan 24, 2023

When working on this issue, consider the conversation we have in #9424 (comment) regarding how to get the default_branch from a repository.

@ericholscher ericholscher changed the title Build: clonning multibranch repository takes too long Build: cloning multibranch repository takes too long Jun 7, 2023
@benjaoming
Copy link
Contributor

benjaoming commented Jun 12, 2023

Here's a quick sketch of what I'm gonna try to do...

  • Add a new feature flag for the behavior. I'll call it GIT_CLONE_SINGLE_BRANCH unless other ideas.

  • Add branch=None to the VCS backend clone() method and try to update calls that it receives (this might just be 1 place that's relevant). Edit: Found a way to avoid this, but more build data has to be added to the VCS Backend so it's aware of what to do.

  • We can always clone without specifying --branch. This also makes sense to support so we don't have to implement the behavior in hg and bzr.

@benjaoming
Copy link
Contributor

This is slightly more complicated, once we start looking at PR branches and tags.

PR branches

These branches are named by GitHub for instance refs/pull/1234, but the naming convention doesn't apply to other Git providers (maybe some, maybe none).

So we skip these when cloning. There's a funny GITHUB_PR_PULL_PATTERN that we use later to fetch the remote specifier.

PR branches have a different history from the repository's default HEAD, so that's why AFAICT we always run fetch after the clone on external builds! We can clone an absolute minimal amount of data here.

(we're not talking about the source branch of a PR, which can exist in an entirely different repo)

Tags

If we want to reference a tag from the Git history, then we need a Git history that contains that tag. From my experiments, --no-single-branch does affect tag histories. I can't tell if it's because it fetches top 50 commits from all branches or because it's ALSO an instruction to fetch all tags. The git manpages are unclear here.

Current implementation seems to rely on tag references available after git clone.

There are 3 good options here:

  1. Use the tag name for git clone --branch <tag-name>, this will leave the checkout in "detached head" (ouch). I don't think we need to worry about that since we only do fetch and checkout operations afterwards.

  2. Start treating tags like external builds and call git fetch --tags (the .fetch() method already handles this) after cloning in order to fetch all tags.

  3. Continue using --no-single-branch as before. So we don't risk changes.

Here's an example of what needs to be compensated for if we remove --no-single-branch while using shallow clones:

git clone https://github.com/readthedocs/readthedocs.org --depth 50 test_single_branch --branch main --single-branch
cd test_single_branch

# 273 is the current number of tags
git tag |wc -l
3

With --no-single-branch

git clone https://github.com/readthedocs/readthedocs.org --depth 50 test_single_branch --branch main --no-single-branch
cd test_single_branch

# 273 is the current number of tags
git tag |wc -l
273

Simplifying things

We can use a more generic clone + fetch + checkout approach.

Circle CI has a different approach:

  1. Firstly, they do a git clone --no-checkout and then fetch afterwards. This pattern makes things a bit more generic, so it doesn't matter if it's the main branch or some other branch.
  2. Afterwards, there's a fetching of exactly the remote reference that they need git fetch --force origin +refs/pull/1234/head:refs/remotes/origin/pull/1234. Clearly, Circle CI has some special catering to GitHub here.
  3. Then they do a git checkout specifying the commit ID. They always have a commit ID for their builds, hence they can build things out of order and not have to rely on the latest commit in a branch.

This seems like a simpler way of doing things, so we don't have different clone commands issued depending on what kind of Build it is.

Proposal

  1. I'm guessing we want to go for a simple optimization here. Removing --no-single-branch is a great improvement for regular branch builds. So I'll continue to do that...

  2. ...regarding the issue with tags, we can keep --no-single-branch for tag builds so we don't need to care about changing this behavior for now.

  3. I think we should open an issue to generalize the clone+fetch+checkout pattern with some inspiration from the Circle CI method mentioned above.

  4. Another improvement to note: We're also removing --no-single-branch from pull request builds. But we are not adding --branch <name>, we don't need to because there's a git fetch operation afterwards.

@benjaoming
Copy link
Contributor

@humitos had this comment that I think is important to align on:

It's important to have the same behavior (git clone command) on PRs than on tags/branches. I want to avoid building successfully on PRs but failing on regular versions, or similar.

We can have almost the same clone command for branches, tags and PRs. Branches and tags can rely on the --branch option but PRs cannot use --branch since the PR branch is a complex part of the current .fetch() method. If we move that into clone, then I don't know what's gonna happen... for some Git providers, we can guess the PR branch reference from its ID, but for other Git providers we never use a branch name, we use only the commit ID which isn't a valid reference for --branch. With this approach, I worry that the scope/risk of this change becomes so large that we might as well start over with a new Git backend behind a feature flag and use git clone --no-checkout as the base for everything.

@ericholscher
Copy link
Member

@benjaoming I think the approach that CircleCI uses seems promising. I don't fully understand it, but I think it makes sense to me to do something like:

  • Get the repo locally
  • Run the commands to get the data we need for that type of build (eg. fetch tags for the tags)
  • Check out the thing we need

It seems that will keep the builds the same for PR branches & merged branches, but I think tags being special seems like the main issue here, and I'm OK with special-casing that logic I think.

@humitos
Copy link
Member Author

humitos commented Jun 13, 2023

Note that --no-checkout does nothing special regarding the data downloaded. It just doesn't perform a git checkout <hash/branch/tag> once the clone is performed. So, it's not relevant for our use case.

After reading the comments here, I think we should:

  • remove --no-single-branch and use --single-branch
  • clone the "default branch" of the repository with --depth 1 (download as minimal upfront data as possible)
  • fetch all the data required for the particular pr/branch/tag we want to build with git fetch --force --depth 50 origin <pr/tag/branch> (note that I'm using --depth 50 here to keep the same behavior as current production)
  • checkout the pr/branch/tag we just fetched

The previous approach in Git commands would be:

git clone --depth 1 --single-branch <repository URL>
git fetch --depth 50 --force origin <pr/branch/tag>
git checkout <pr/branch/tag>

A real example using test-builds and checking out a PR:

/tmp
▶ git clone --depth 1 --single-branch https://github.com/readthedocs/test-builds
Cloning into 'test-builds'...
remote: Enumerating objects: 11, done.
remote: Counting objects: 100% (11/11), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 11 (delta 0), reused 5 (delta 0), pack-reused 0
Receiving objects: 100% (11/11), done.

/tmp
▶ cd test-builds 

/tmp/test-builds  main ✔
▶ git fetch --depth 50 --force origin pull/2106/head:external-2106
remote: Enumerating objects: 95, done.
remote: Counting objects: 100% (95/95), done.
remote: Compressing objects: 100% (49/49), done.
remote: Total 87 (delta 43), reused 69 (delta 33), pack-reused 0
Unpacking objects: 100% (87/87), 14.76 KiB | 2.95 MiB/s, done.
From https://github.com/readthedocs/test-builds
 * [new ref]         refs/pull/2106/head -> external-2106
remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 1 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (1/1), 148 bytes | 148.00 KiB/s, done.
 * [new tag]         1.0                 -> 1.0
 * [new tag]         2.0                 -> 2.0
 * [new tag]         3.0                 -> 3.0
 * [new tag]         4.0                 -> 4.0
 * [new tag]         5.0                 -> 5.0
 * [new tag]         5.1                 -> 5.1
 * [new tag]         5.2                 -> 5.2
 * [new tag]         5.3                 -> 5.3
 * [new tag]         5.4                 -> 5.4
 * [new tag]         annotated-tag       -> annotated-tag
 * [new tag]         be-default          -> be-default
 * [new tag]         deleteme-1          -> deleteme-1
 * [new tag]         deleteme-2          -> deleteme-2
 * [new tag]         deleteme-3          -> deleteme-3
 * [new tag]         deleteme-4          -> deleteme-4
 * [new tag]         deleteme-5          -> deleteme-5
 * [new tag]         deleteme-6          -> deleteme-6
 * [new tag]         deleteme-7          -> deleteme-7
 * [new tag]         deleteme-8          -> deleteme-8
 * [new tag]         deleteme-9          -> deleteme-9
 * [new tag]         dont-be-default-    -> dont-be-default-
 * [new tag]         match-me            -> match-me
 * [new tag]         matche-me-again     -> matche-me-again
 * [new tag]         new-tag             -> new-tag
 * [new tag]         new-tag-activated   -> new-tag-activated
 * [new tag]         release-ünîø∂é -> release-ünîø∂é
 * [new tag]         tag-v1              -> tag-v1
 * [new tag]         tat-not-semver      -> tat-not-semver
 * [new tag]         v1.2.4              -> v1.2.4

/tmp/test-builds  main ✔
▶ git checkout external-2106
Switched to branch 'external-2106'

/tmp/test-builds  external-2106 ✔
▶

We could use git fetch --no-tags, but that won't create the tags locally and we won't be able to check out them after fetching the data. I would avoid to have a special case for this (at least for the first implementation), but we could if we want to reduce more data. If we go in that direction in the future, when building a tag we will need to remove the --no-tags from that command. That's all.

@benjaoming
Copy link
Contributor

benjaoming commented Jun 13, 2023

@humitos Awesome! That's very similar to what I've been wanting, too.

Re: the --no-checkout part just makes it explicit that we're cloning but we need to checkout something after, nothing more. Adding --depth 1 is enough to optimize the clone operation to be as sparse as possible AFAICT.. I wasn't actually sure if we were changing the depth, but I really have been wanting to set it to 1 while looking at the code and wonder "why 50?" :)

The --single-branch is actually the default behavior - but before, when we were using --no-single-branch, we were getting a lot of extra branch references (possibly the combination with --depth 50 matters too). Isn't that also why this issue only exists for self-hosted GitLab? Would expect also Gitea (codeberg.org) and Bitbucket to be affected here: #9464 (comment)

So if we go in this direction, we have to make sure that we are doing git fetch <remote pr reference> on the right reference patterns for each Git provider. Not a big issue, and we might end up fixing #9464 at the same time 👍

I would avoid to have a special case for this (at least for the first implementation), but we could if we want to reduce more data.

Having a simple approach where branches, tags and PRs are all cloned, fetched and checked out with the same commands will make it a lot easier to tweak more things 💯

Changes proposed here are still very contained: AFAICT, everything runs through .update() so by implementing this the right way, we only need to hide changes behind a feature flag once.

@benjaoming
Copy link
Contributor

Test with cpython:

image

@benjaoming
Copy link
Contributor

Improvements in the cctx project:

image

@benjaoming benjaoming moved this from Planned to Needs review in 📍Roadmap Jun 21, 2023
@benjaoming
Copy link
Contributor

cpython-previews before: 85 seconds
cpython-previews after: 9 seconds

ccxt before: 506 seconds
ccxt after: 141 seconds

@humitos humitos self-assigned this Jul 3, 2023
@github-project-automation github-project-automation bot moved this from Needs review to Done in 📍Roadmap Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Improvement Minor improvement to code
Projects
Archived in project
3 participants