Skip to content

feature: Git integration for CodeCommit #927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 12, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 27 additions & 7 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ Here is an example:

Use Scripts Stored in a Git Repository
--------------------------------------
When you create an estimator, you can specify a training script that is stored in a GitHub or other Git repository as the entry point for the estimator, so that you don't have to download the scripts locally.
When you create an estimator, you can specify a training script that is stored in a GitHub (or other Git) or CodeCommit repository as the entry point for the estimator, so that you don't have to download the scripts locally.
If you do so, source directory and dependencies should be in the same repo if they are needed. Git support can be enabled simply by providing ``git_config`` parameter
when creating an ``Estimator`` object. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies``
should be relative paths in the Git repo if provided.
Expand All @@ -195,19 +195,26 @@ The ``git_config`` parameter includes fields ``repo``, ``branch``, ``commit``,
repository where your training script is stored. If you don't provide ``branch``, the default value 'master' is used.
If you don't provide ``commit``, the latest commit in the specified branch is used.

``2FA_enabled``, ``username``, ``password`` and ``token`` are used for authentication. Set ``2FA_enabled`` to 'True' if
two-factor authentication is enabled for the GitHub (or other Git) account, otherwise set it to 'False'.
If you do not provide a value for ``2FA_enabled``, a default value of 'False' is used.
``2FA_enabled``, ``username``, ``password`` and ``token`` are used for authentication. For GitHub
(or other Git) accounts, set ``2FA_enabled`` to 'True' if two-factor authentication is enabled for the
account, otherwise set it to 'False'. If you do not provide a value for ``2FA_enabled``, a default
value of 'False' is used. CodeCommit does not support two-factor authentication, so do not provide
"2FA_enabled" with CodeCommit repositories.

For GitHub or other Git repositories,
If ``repo`` is an SSH URL, you should either have no passphrase for the SSH key pairs, or have the ``ssh-agent`` configured
so that you are not prompted for the SSH passphrase when you run a ``git clone`` command with SSH URLs. For SSH URLs, it
does not matter whether two-factor authentication is enabled.

If ``repo`` is an https URL, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
does not matter whether two-factor authentication is enabled. If ``repo`` is an HTTPS URL, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
used for authentication if provided (``token`` prioritized). When 2FA is enabled, only token will be used for
authentication if provided. If required authentication info is not provided, python SDK will try to use local
credentials storage to authenticate. If that fails either, an error message will be thrown.

For CodeCommit repos, please make sure you have completed the authentication setup: https://docs.aws.amazon.com/codecommit/latest/userguide/setting-up.html.
2FA is not supported by CodeCommit, so ``2FA_enabled`` should not be provided. There is no token in CodeCommit, so
``token`` should not be provided either. If ``repo`` is an SSH URL, the requirements are the same as GitHub repos.
If ``repo`` is an HTTPS URL, ``username``+``password`` will be used for authentication if they are provided; otherwise,
Python SDK will try to use either CodeCommit credential helper or local credential storage for authentication.

Here are some examples of creating estimators with Git support:

.. code:: python
Expand Down Expand Up @@ -276,6 +283,19 @@ Here are some examples of creating estimators with Git support:
train_instance_count=1,
train_instance_type='local')

.. code:: python

# This example specifies a CodeCommit repository, and try to authenticate with provided username+password
git_config = {'repo': 'https://git-codecommit.us-west-2.amazonaws.com/v1/repos/your_repo_name',
'username': 'username',
'password': 'passw0rd!'}

mx_estimator = MXNet(entry_point='mxnet/mnist.py',
role='SageMakerRole',
git_config=git_config,
train_instance_count=1,
train_instance_type='ml.c4.xlarge')

Git support can be used not only for training jobs, but also for hosting models. The usage is the same as the above,
and ``git_config`` should be provided when creating model objects, e.g. ``TensorFlowModel``, ``MXNetModel``, ``PyTorchModel``.

Expand Down
37 changes: 23 additions & 14 deletions src/sagemaker/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -976,11 +976,10 @@ def __init__(

You can assign entry_point='src/train.py'.
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token`` (default: None). The fields are
optional except ``repo``. If ``branch`` is not specified, master branch will be used. If ``commit``
is not specified, the latest commit in the required branch will be used. 'branch' and 'commit' are
optional. If 'branch' is not specified, 'master' branch will be used. If 'commit' is not specified,
the latest commit in the required branch will be used.
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The ``repo`` field is required.
All other fields are optional. ``repo`` specifies the Git repository where your training script is
stored. If you don't provide ``branch``, the default value 'master' is used. If you don't provide
``commit``, the latest commit in the specified branch is used.
Example:

The following config:
Expand All @@ -991,15 +990,25 @@ def __init__(

results in cloning the repo specified in 'repo', then checkout the 'master' branch, and checkout
the specified commit.
``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose.
``2FA_enabled`` must be ``True`` or ``False`` if it is provided. If ``2FA_enabled`` is not provided,
we consider 2FA as disabled. For GitHub and other Git repos, when ssh urls are provided, it does not
make a difference whether 2FA is enabled or disabled; an ssh passphrase should be in local storage.
When https urls are provided: if 2FA is disabled, then either token or username+password will
be used for authentication if provided (token prioritized); if 2FA is enabled, only token will
be used for authentication if provided. If required authentication info is not provided, python SDK
will try to use local credentials storage to authenticate. If that fails either, an error message will
be thrown.
``2FA_enabled``, ``username``, ``password`` and ``token`` are used for authentication. For GitHub
(or other Git) accounts, set ``2FA_enabled`` to 'True' if two-factor authentication is enabled for the
account, otherwise set it to 'False'. If you do not provide a value for ``2FA_enabled``, a default
value of 'False' is used. CodeCommit does not support two-factor authentication, so do not provide
"2FA_enabled" with CodeCommit repositories.

For GitHub and other Git repos, when SSH URLs are provided, it doesn't matter whether 2FA is
enabled or disabled; you should either have no passphrase for the SSH key pairs, or have the ssh-agent
configured so that you will not be prompted for SSH passphrase when you do 'git clone' command with SSH
URLs. When HTTPS URLs are provided: if 2FA is disabled, then either token or username+password will be
used for authentication if provided (token prioritized); if 2FA is enabled, only token will be used for
authentication if provided. If required authentication info is not provided, python SDK will try to use
local credentials storage to authenticate. If that fails either, an error message will be thrown.

For CodeCommit repos, 2FA is not supported, so '2FA_enabled' should not be provided. There is no token
in CodeCommit, so 'token' should not be provided too. When 'repo' is an SSH URL, the requirements are
the same as GitHub-like repos. When 'repo' is an HTTPS URL, username+password will be used for
authentication if they are provided; otherwise, python SDK will try to use either CodeCommit credential
helper or local credential storage for authentication.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from the entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker. If 'git_config' is provided,
Expand Down
77 changes: 64 additions & 13 deletions src/sagemaker/git_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,18 +25,26 @@ def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
and set ``entry_point``, ``source_dir`` and ``dependencies`` to the right file or directory in the repo cloned.

Args:
git_config (dict[str, object]): Git configurations used for cloning files, including ``repo``, ``branch``,
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The fields are optional except
``repo``. If ``branch`` is not specified, master branch will be used. If ``commit`` is not specified,
the latest commit in the required branch will be used. ``2FA_enabled``, ``username``, ``password`` and
``token`` are for authentication purpose.
``2FA_enabled`` must be ``True`` or ``False`` if it is provided. If ``2FA_enabled`` is not provided, we
consider 2FA as disabled. For GitHub and other Git repos, when ssh urls are provided, it does not make a
difference whether 2FA is enabled or disabled; an ssh passphrase should be in local storage. When
https urls are provided: if 2FA is disabled, then either token or username+password will be used for
authentication if provided (token prioritized); if 2FA is enabled, only token will be used for
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The ``repo`` field is required.
All other fields are optional. ``repo`` specifies the Git repository where your training script is stored.
If you don't provide ``branch``, the default value 'master' is used. If you don't provide ``commit``,
the latest commit in the specified branch is used. ``2FA_enabled``, ``username``, ``password`` and
``token`` are for authentication purpose. If ``2FA_enabled`` is not provided, we consider 2FA as disabled.

For GitHub and GitHub-like repos, when SSH URLs are provided, it doesn't matter whether 2FA is
enabled or disabled; you should either have no passphrase for the SSH key pairs, or have the ssh-agent
configured so that you will not be prompted for SSH passphrase when you do 'git clone' command with SSH
URLs. When https URLs are provided: if 2FA is disabled, then either token or username+password will be
used for authentication if provided (token prioritized); if 2FA is enabled, only token will be used for
authentication if provided. If required authentication info is not provided, python SDK will try to use
local credentials storage to authenticate. If that fails either, an error message will be thrown.

For CodeCommit repos, 2FA is not supported, so '2FA_enabled' should not be provided. There is no token in
CodeCommit, so 'token' should not be provided too. When 'repo' is an SSH URL, the requirements are the
same as GitHub-like repos. When 'repo' is an https URL, username+password will be used for
authentication if they are provided; otherwise, python SDK will try to use either CodeCommit credential
helper or local credential storage for authentication.
entry_point (str): A relative location to the Python source file which should be executed as the entry point
to training or model hosting in the Git repo.
source_dir (str): A relative location to a directory with other training or model hosting source code
Expand Down Expand Up @@ -115,7 +123,12 @@ def _generate_and_run_clone_command(git_config, dest_dir):
Raises:
CalledProcessError: If failed to clone git repo.
"""
_clone_command_for_github_like(git_config, dest_dir)
if git_config["repo"].startswith("https://git-codecommit") or git_config["repo"].startswith(
"ssh://git-codecommit"
):
_clone_command_for_codecommit(git_config, dest_dir)
else:
_clone_command_for_github_like(git_config, dest_dir)


def _clone_command_for_github_like(git_config, dest_dir):
Expand All @@ -136,14 +149,14 @@ def _clone_command_for_github_like(git_config, dest_dir):
if not is_https and not is_ssh:
raise ValueError("Invalid Git url provided.")
if is_ssh:
_clone_command_for_github_like_ssh(git_config, dest_dir)
_clone_command_for_ssh(git_config, dest_dir)
elif "2FA_enabled" in git_config and git_config["2FA_enabled"] is True:
_clone_command_for_github_like_https_2fa_enabled(git_config, dest_dir)
else:
_clone_command_for_github_like_https_2fa_disabled(git_config, dest_dir)


def _clone_command_for_github_like_ssh(git_config, dest_dir):
def _clone_command_for_ssh(git_config, dest_dir):
if "username" in git_config or "password" in git_config or "token" in git_config:
warnings.warn("SSH cloning, authentication information in git config will be ignored.")
_run_clone_command(git_config["repo"], dest_dir)
Expand Down Expand Up @@ -173,6 +186,44 @@ def _clone_command_for_github_like_https_2fa_enabled(git_config, dest_dir):
_run_clone_command(updated_url, dest_dir)


def _clone_command_for_codecommit(git_config, dest_dir):
"""check if a git_config param representing a CodeCommit repo is valid, if it is, create the command to
git clone the repo, and run it.

Args:
git_config ((dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
and ``commit``.
dest_dir (str): The local directory to clone the Git repo into.

Raises:
ValueError: If git_config['repo'] is in the wrong format.
CalledProcessError: If failed to clone git repo.
"""
is_https = git_config["repo"].startswith("https://git-codecommit")
is_ssh = git_config["repo"].startswith("ssh://git-codecommit")
if not is_https and not is_ssh:
raise ValueError("Invalid Git url provided.")
if "2FA_enabled" in git_config:
warnings.warn("CodeCommit does not support 2FA, '2FA_enabled' will be ignored.")
if "token" in git_config:
warnings.warn("There are no tokens in CodeCommit, the token provided will be ignored.")
if is_ssh:
_clone_command_for_ssh(git_config, dest_dir)
else:
_clone_command_for_codecommit_https(git_config, dest_dir)


def _clone_command_for_codecommit_https(git_config, dest_dir):
updated_url = git_config["repo"]
if "username" in git_config and "password" in git_config:
updated_url = _insert_username_and_password_to_repo_url(
url=git_config["repo"], username=git_config["username"], password=git_config["password"]
)
elif "username" in git_config or "password" in git_config:
warnings.warn("Credentials provided in git config will be ignored.")
_run_clone_command(updated_url, dest_dir)


def _run_clone_command(repo_url, dest_dir):
"""Run the 'git clone' command with the repo url and the directory to clone the repo into.

Expand Down
Loading