Skip to content

Execute checkout step inside docker #6253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Oct 8, 2019

This is a POC from #5673 (comment)

I'm making a PR so is easy to discuss some problems? Or hope so.

Design decisions

How to install/copy the script?

Currently, I'm just mounting a volume inside the container, but we can make it an instalable package.

How to install the deps?

Currently, I'm installing gitpython in the global python environment of the container, but we can create a virtualenv just for this package (we need to refactor our class that creates virtualenvs for this).

Other problems

We have chicken/egg problem, we don't know the build image to use before the checkout step, so we'll be using two build env, one for the checkout process (we can use any docker image we want) and other for the normal build process. We can keep it this way and initialize the ssh agent on each image or use the providers API to query the config file before cloning. The second option introduce more changes and removes support for repos that are not hosted on those providers, so I think we are just fine with the first option for now (this may be a problem when we move to lambda functions I think?)

@stsewd stsewd changed the title Stsewd/get git info from container Execute checkout step inside docker Oct 8, 2019
@stsewd stsewd added the Needed: design decision A core team decision is required label Oct 8, 2019
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. Although, I'm not sold on the idea behind this. I'm still in doubt what's a good way to implement this --I'm not seeing clearly what's the path to follow :(

I'm just mounting a volume inside the container, but we can make it an instalable package.

I think mounting it's easier. It will allow us to make changes to the script without adding the overhead of having another package in the release/deploy process.

I'm installing gitpython in the global python environment of the container, but we can create a virtualenv just for this package (we need to refactor our class that creates virtualenvs for this).

Using the global site-packages seems enough to me at this point.

We have chicken/egg problem, we don't know the build image to use before the checkout step, so we'll be using two build env, one for the checkout process (we can use any docker image we want) and other for the normal build process

This is my main concern. It feels too complicated to spin up two images. How are we going to share the cloned code between these two images?

It's probably slow as well since we are going to spin up a 8gb image just to run git. We may want a small image that contains only the packages needed for VCS...

Another thing to consider here is that with this PR we are only solving the git case, but we need to do the same for the others as well --where we are using some custom code that we wrote for the parsing.

Finally, I think that I lost the main goal of all of this. What's the problem we are solving by running git inside the container? I'd like to have it 100% isolated, but it seems that we are adding some issues (and I'm not sure what we are solving)

There are some changes that look unrelated to this PR (i.e. docker= argument)

@@ -218,6 +218,7 @@ def USE_PROMOS(self): # noqa
LOGS_ROOT = os.path.join(SITE_ROOT, 'logs')
PRODUCTION_ROOT = os.path.join(SITE_ROOT, 'prod_artifacts')
PRODUCTION_MEDIA_ARTIFACTS = os.path.join(PRODUCTION_ROOT, 'media')
RTD_SCRIPTS_PATH = os.path.join(SITE_ROOT, 'rtd_scripts')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this inside the doc_environments/ directory or similar, but not at the top of the repository.

@@ -54,6 +60,39 @@ def _get_clone_url(self):
# clone_url = 'git://%s' % (hacked_url)
return self.repo_url

def _get_repo_info(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be a @property called _repo.

@@ -0,0 +1,145 @@
#!/usr/bin/env python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be named mentioning git on its name.

@stsewd
Copy link
Member Author

stsewd commented Nov 4, 2019

How are we going to share the cloned code between these two images?

We have a shared volume, is the same as executing the checkout step as now.

Another thing to consider here is that with this PR we are only solving the git case, but we need to do the same for the others as well --where we are using some custom code that we wrote for the parsing.

Those commands are executed inside the build environment (docker or local), we only need to do this with git because we are using a library (gitpython)

Finally, I think that I lost the main goal of all of this. What's the problem we are solving by running git inside the container? I'd like to have it 100% isolated, but it seems that we are adding some issues (and I'm not sure what we are solving)

We want to be able to run the ssh agent inside docker for .com.

@humitos
Copy link
Member

humitos commented Nov 4, 2019

Those commands are executed inside the build environment (docker or local), we only need to do this with git because we are using a library (gitpython)

So, considering that we already hit some gitpython bugs and that all this code/PR is needed only for git, isn't it better to go back to our own implementation that parses the git command output and forget about this?

@stsewd
Copy link
Member Author

stsewd commented Nov 4, 2019

So, considering that we already hit some gitpython bugs and that all this code/PR is needed only for git, isn't it better to go back to our own implementation that parses the git command output and forget about this?

That was an option #5673
But I remember that Anthony was for the second option too

@agjohnson
Copy link
Contributor

I think we need a call on this one to discuss options and this implementation.

On my side: regarding an option that uses GitPython vs parsing the output, I suppose I'd rather rely on GitPython to do this. Mostly for the reason that we'll have to maintain a connection to git/hg no matter what -- GitPython is at least a stronger choice for this job. I'm partially unsure about the connection mechanism between the application and the container. I do feel like there is maybe something more we could do here. The biggest hurdle is how code is organized, it does feel a little lost in a one-off script now. We could explore making this a separate package if it would be cleaner.

@agjohnson
Copy link
Contributor

We didn't come to a conclusion here, I think because we all have reservations about the solutions we've noted so far -- and because we don't really have any better opinions.

This change does address a security issue around exploiting git, which would be good to address eventually.

The other part of this work is to unblock sharing an ssh-agent between git checkout and docker operations. This is perhaps the more immediate change we can introduce, and directly makes the user's experience better.

Two options here are:

  • Keep the ssh-agent out of docker, pass in the socket so steps can have access to the agent. ssh-agent on the host system dies after timeout
  • Put the ssh-agent in docker, expose the socket to the host system for git checkout steps. ssh-agent lifecycle is directly linked to docker container -- it lives the entire time the container is in operation, it dies as soon as the container is killed.

@ericholscher
Copy link
Member

ericholscher commented Dec 4, 2019

I guess I want to start from first-principals a bit -- is there a reason we can't run the code we currently have, just inside Docker? That seems to me to be the most obvious first step, but I don't know why that approach doesn't work.

Thinking on this more -- it's because we'd need the entire RTD codebase running inside docker, because gitpython is running in process. I think going back to the previous approach where we just run the git commands directly and pass them into docker like all the other commands feels like a better outcome now that I think about it.

If I remember correctly, we needed gitpython because of unicode branch names. I'm wondering if upgrading to Python 3 has solved this for us, and we can just go back to calling git directly.

@agjohnson
Copy link
Contributor

It wasn't just unicode branches, it was the actual parsing where we weren't consisently parsing the output correctly, and gitpython does this all for us. It does seem like a lot of duplication of effort for a jankier solution.

We can also be looking into gitpython/hg internals and see if it's possible to replace the execution method, or to parse a different i/o stream that the system i/o (i/o from docker exec in our case).

@stsewd
Copy link
Member Author

stsewd commented Dec 4, 2019

Closing in favor of #6436

There is still one decision, related to use the docker image just with git or use our build images

@stsewd stsewd closed this Dec 4, 2019
@stsewd stsewd deleted the stsewd/get-git-info-from-container branch December 4, 2019 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needed: design decision A core team decision is required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants