Skip to content

feature: deal with credentials for Git support for GitHub #914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 11, 2019
45 changes: 38 additions & 7 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,16 +185,37 @@ Here is an example:

Git Support
-----------
If you have your training scripts in your GitHub repository, you can use them directly without the trouble to download
them to local machine. Git support can be enabled simply by providing ``git_config`` parameter when initializing an
estimator. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies`` should all be relative
paths in the Git repo. Note that if you decided to use Git support, then everything you need for ``entry_point``,
``source_dir`` and ``dependencies`` should be in a single Git repo.
If you have your training scripts in your GitHub (or GitHub-like) repository, you can use them directly without the
trouble to download them to local machine. Git support can be enabled simply by providing ``git_config`` parameter
when initializing an estimator. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies``
should all be relative paths in the Git repo. Note that if you decided to use Git support, then everything you need
for ``entry_point``, ``source_dir`` and ``dependencies`` should be in a single Git repo.

Here are ways to specify ``git_config``:
The ``git_config`` parameter includes arguments ``repo``, ``branch``, ``commit``, ``2FA_enabled``, ``username``,
``password`` and ``token``. Except for ``repo``, the other arguments are optional. ``repo`` specifies the Git repository
that you want to use. If ``branch`` is not provided, master branch will be used. If ``commit`` is not provided,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo specifies the Git repository where your training script is stored. If you don't provide branch, the default value ''master'' is used. If you don't provide commit, the latest commit in the specified branch is used.

the latest commit in the required branch will be used.

``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose. ``2FA_enabled`` should
be ``True`` or ``False``, provides the information whether two-factor authentication is enabled for the GitHub (or GitHub-like) account.
If ``2FA_enabled`` is not provided, we consider 2FA as disabled.

If ``repo`` is an ssh url, you should either have no passphrase for the ssh key pairs, or have the ssh-agent configured
so that you will not be prompted for ssh passphrase when you do 'git clone' command with ssh urls. For ssh urls, it
makes no difference whether the 2FA is enabled or disabled.

If ``repo`` is an https url, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
used for authentication if provided (``token`` prioritized). When 2FA is enabled, only token will be used for
authentication if provided. If required authentication info is not provided, python SDK will try to use local
credentials storage to authenticate. If that fails either, an error message will be thrown.

Here are some ways to specify ``git_config``:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ask @eslesar-aws to review the writing.


.. code:: python

# The following three examples do not provide Git credentials, so python SDK will try to use
# local credential storage.

# Specifies the git_config parameter
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'branch': 'branch1',
Expand All @@ -209,6 +230,17 @@ Here are ways to specify ``git_config``:
# 'master' branch will be used.
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git'}

# This example does not provide '2FA_enabled', so 2FA is treated as disabled by default. 'username' and
# 'password' are provided for authentication
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'username': 'username',
'password': 'passw0rd!'}

# This example specifies that 2FA is enabled, and token is provided for authentication
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'2FA_enabled': True,
'token': 'your-token'}

The following are some examples to define estimators with Git support:

.. code:: python
Expand Down Expand Up @@ -240,7 +272,6 @@ The following are some examples to define estimators with Git support:
train_instance_count=1,
train_instance_type='ml.c4.xlarge')

When Git support is enabled, users can still use local mode in the same way.

Training Metrics
----------------
Expand Down
20 changes: 15 additions & 5 deletions src/sagemaker/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@
from abc import abstractmethod
from six import with_metaclass
from six import string_types

import sagemaker
from sagemaker import git_utils
from sagemaker.analytics import TrainingJobAnalytics

from sagemaker.fw_utils import (
create_image_uri,
tar_and_upload_dir,
Expand Down Expand Up @@ -976,10 +976,12 @@ def __init__(
>>> |----- test.py

You can assign entry_point='src/train.py'.
git_config (dict[str, str]): Git configurations used for cloning files, including 'repo', 'branch'
and 'commit' (default: None).
'branch' and 'commit' are optional. If 'branch' is not specified, 'master' branch will be used. If
'commit' is not specified, the latest commit in the required branch will be used.
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token`` (default: None). The fields are
optional except ``repo``. If ``branch`` is not specified, master branch will be used. If ``commit``
is not specified, the latest commit in the required branch will be used. 'branch' and 'commit' are
optional. If 'branch' is not specified, 'master' branch will be used. If 'commit' is not specified,
the latest commit in the required branch will be used.
Example:

The following config:
Expand All @@ -990,6 +992,14 @@ def __init__(

results in cloning the repo specified in 'repo', then checkout the 'master' branch, and checkout
the specified commit.
``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose. If
``2FA_enabled`` is not provided, we consider 2FA as disabled. For GitHub and GitHub-like repos, when
ssh urls are provided, it does not make a difference whether 2FA is enabled or disabled; an ssh
passphrase should be in local storage. When https urls are provided: if 2FA is disabled, then either
token or username+password will be used for authentication if provided (token prioritized); if 2FA is
enabled, only token will be used for authentication if provided. If required authentication info is
not provided, python SDK will try to use local credentials storage to authenticate. If that fails
either, an error message will be thrown.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from the entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker. If 'git_config' is provided,
Expand Down
205 changes: 176 additions & 29 deletions src/sagemaker/git_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,27 @@
import six
import subprocess
import tempfile
import warnings
from six.moves import urllib


def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
"""Git clone repo containing the training code and serving code. This method also validate ``git_config``,
and set ``entry_point``, ``source_dir`` and ``dependencies`` to the right file or directory in the repo cloned.

Args:
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
and ``commit``. ``branch`` and ``commit`` are optional. If ``branch`` is not specified, master branch
will be used. If ``commit`` is not specified, the latest commit in the required branch will be used.
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The fields are optional except
``repo``. If ``branch`` is not specified, master branch will be used. If ``commit`` is not specified,
the latest commit in the required branch will be used. ``2FA_enabled``, ``username``, ``password`` and
``token`` are for authentication purpose.
If ``2FA_enabled`` is not provided, we consider 2FA as disabled. For GitHub and GitHub-like repos, when
ssh urls are provided, it does not make a difference whether 2FA is enabled or disabled; an ssh passphrase
should be in local storage. When https urls are provided: if 2FA is disabled, then either token or
username+password will be used for authentication if provided (token prioritized); if 2FA is enabled,
only token will be used for authentication if provided. If required authentication info is not provided,
python SDK will try to use local credentials storage to authenticate. If that fails either, an error message
will be thrown.
entry_point (str): A relative location to the Python source file which should be executed as the entry point
to training or model hosting in the Git repo.
source_dir (str): A relative location to a directory with other training or model hosting source code
Expand All @@ -41,16 +52,14 @@ def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
ValueError: If 1. entry point specified does not exist in the repo
2. source dir specified does not exist in the repo
3. dependencies specified do not exist in the repo
4. git_config is in bad format
4. wrong format is provided for git_config

Returns:
dict: A dict that contains the updated values of entry_point, source_dir and dependencies
dict: A dict that contains the updated values of entry_point, source_dir and dependencies.
"""
if entry_point is None:
raise ValueError("Please provide an entry point.")
_validate_git_config(git_config)
repo_dir = tempfile.mkdtemp()
subprocess.check_call(["git", "clone", git_config["repo"], repo_dir])
_generate_and_run_clone_command(git_config, repo_dir)

_checkout_branch_and_commit(git_config, repo_dir)

Expand All @@ -72,44 +81,182 @@ def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
updated_paths["entry_point"] = os.path.join(repo_dir, entry_point)
else:
raise ValueError("Entry point does not exist in the repo.")

updated_paths["dependencies"] = []
for path in dependencies:
if os.path.exists(os.path.join(repo_dir, path)):
updated_paths["dependencies"].append(os.path.join(repo_dir, path))
else:
raise ValueError("Dependency {} does not exist in the repo.".format(path))
if dependencies is None:
updated_paths["dependencies"] = None
else:
updated_paths["dependencies"] = []
for path in dependencies:
if os.path.exists(os.path.join(repo_dir, path)):
updated_paths["dependencies"].append(os.path.join(repo_dir, path))
else:
raise ValueError("Dependency {} does not exist in the repo.".format(path))
return updated_paths


def _validate_git_config(git_config):
"""check if a git_config param is valid
if "repo" not in git_config:
raise ValueError("Please provide a repo for git_config.")
string_args = ["repo", "branch", "commit", "username", "password", "token"]
for key in string_args:
if key in git_config and not isinstance(git_config[key], six.string_types):
raise ValueError("'{}' must be a string.".format(key))
if "2FA_enabled" in git_config and not isinstance(git_config["2FA_enabled"], bool):
raise ValueError("'2FA_enabled' must be a bool value.")
allowed_keys = ["repo", "branch", "commit", "2FA_enabled", "username", "password", "token"]
for k in list(git_config):
if k not in allowed_keys:
raise ValueError("Unexpected git_config argument(s) provided!")


def _generate_and_run_clone_command(git_config, repo_dir):
"""check if a git_config param is valid, if it is, create the command to git clone the repo, and run it.

Args:
git_config ((dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
and ``commit``.
repo_dir (str): The local directory to clone the Git repo into.

Raises:
ValueError: If:
1. git_config has no key 'repo'
2. git_config['repo'] is in the wrong format.
CalledProcessError: If failed to clone git repo.
"""
if "repo" not in git_config:
raise ValueError("Please provide a repo for git_config.")
allowed_keys = ["repo", "branch", "commit"]
for key in allowed_keys:
if key in git_config and not isinstance(git_config[key], six.string_types):
raise ValueError("'{}' should be a string".format(key))
for key in git_config:
if key not in allowed_keys:
raise ValueError("Unexpected argument(s) provided for git_config!")
exists = {
"2FA_enabled": "2FA_enabled" in git_config and git_config["2FA_enabled"] is True,
"username": "username" in git_config,
"password": "password" in git_config,
"token": "token" in git_config,
}
_clone_command_for_github_like(git_config, repo_dir, exists)


def _clone_command_for_github_like(git_config, repo_dir, exists):
"""check if a git_config param representing a GitHub (or like) repo is valid, if it is, create the command to
git clone the repo, and run it.

Args:
git_config ((dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
and ``commit``.
repo_dir (str): The local directory to clone the Git repo into.

Raises:
ValueError: If git_config['repo'] is in the wrong format.
CalledProcessError: If failed to clone git repo.
"""
is_https = git_config["repo"].startswith("https://")
is_ssh = git_config["repo"].startswith("git@")
if not is_https and not is_ssh:
raise ValueError("Invalid Git url provided.")
if is_ssh:
_clone_command_for_github_like_ssh(git_config, repo_dir, exists)
elif exists["2FA_enabled"]:
_clone_command_for_github_like_https_2fa_enabled(git_config, repo_dir, exists)
else:
_clone_command_for_github_like_https_2fa_disabled(git_config, repo_dir, exists)


def _clone_command_for_github_like_ssh(git_config, repo_dir, exists):
if exists["username"] or exists["password"] or exists["token"]:
warnings.warn("Unnecessary credential argument(s) provided.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about which credentials are not used in the warning message?

Copy link
Contributor

@chuyang-deng chuyang-deng Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SSH cloning, authentication information in git config will be ignored."

_run_clone_command(git_config["repo"], repo_dir)


def _clone_command_for_github_like_https_2fa_disabled(git_config, repo_dir, exists):
updated_url = git_config["repo"]
if exists["token"]:
if exists["username"] or exists["password"]:
warnings.warn(
"Using token for authentication, "
"but unnecessary credential argument(s) provided."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

)
updated_url = _insert_token_to_repo_url(url=git_config["repo"], token=git_config["token"])
elif exists["username"] and exists["password"]:
updated_url = _insert_username_and_password_to_repo_url(
url=git_config["repo"], username=git_config["username"], password=git_config["password"]
)
elif exists["username"] or exists["password"]:
warnings.warn("Unnecessary credential argument(s) provided.")
_run_clone_command(updated_url, repo_dir)


def _clone_command_for_github_like_https_2fa_enabled(git_config, repo_dir, exists):
updated_url = git_config["repo"]
if exists["token"]:
if exists["username"] or exists["password"]:
warnings.warn(
"Using token for authentication, "
"but unnecessary credential argument(s) provided."
)
updated_url = _insert_token_to_repo_url(url=git_config["repo"], token=git_config["token"])
elif exists["username"] or exists["password"] or exists["token"]:
warnings.warn(
"Unnecessary credential argument(s) provided."
"Hint: since two factor authentication is enabled, you have to provide token."
)
_run_clone_command(updated_url, repo_dir)


def _run_clone_command(repo_url, repo_dir):
"""Run the 'git clone' command with the repo url and the directory to clone the repo into.

Args:
repo_url (str): Git repo url to be cloned.
repo_dir: (str): Local path where the repo should be cloned into.

Raises:
CalledProcessError: If failed to clone git repo.
"""
my_env = os.environ.copy()
if repo_url.startswith("https://"):
my_env["GIT_TERMINAL_PROMPT"] = "0"
elif repo_url.startswith("git@"):
f = tempfile.NamedTemporaryFile()
w = open(f.name, "w")
w.write("ssh -oBatchMode=yes $@")
w.close()
# 511 in decimal is same as 777 in octal
os.chmod(f.name, 511)
my_env["GIT_SSH"] = f.name
subprocess.check_call(["git", "clone", repo_url, repo_dir], env=my_env)


def _insert_token_to_repo_url(url, token):
"""Insert the token to the Git repo url, to make a component of the git clone command. This method can
only be called when repo_url is an https url.

Args:
url (str): Git repo url where the token should be inserted into.
token (str): Token to be inserted.

Returns:
str: the component needed fot the git clone command.
"""
index = len("https://")
return url[:index] + token + "@" + url[index:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tip: use replace

url.replace('https://', 'https://' + token + '@')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle the case which the token and @ are already included in the url.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have dealt with that situation. Do we have the same concern for username and password? I am thinking maybe not because people usually won't first encode the password and then insert it into url.



def _insert_username_and_password_to_repo_url(url, username, password):
"""Insert the username and the password to the Git repo url, to make a component of the git clone command.
This method can only be called when repo_url is an https url.

Args:
url (str): Git repo url where the token should be inserted into.
username (str): Username to be inserted.
password (str): Password to be inserted.

Returns:
str: the component needed fot the git clone command.
"""
password = urllib.parse.quote_plus(password)
# urllib parses ' ' as '+', but what we need is '%20' here
password = password.replace("+", "%20")
index = len("https://")
return url[:index] + username + ":" + password + "@" + url[index:]


def _checkout_branch_and_commit(git_config, repo_dir):
"""Checkout the required branch and commit.

Args:
git_config: (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
and ``commit``.
repo_dir (str): the directory where the repo is cloned

Expand Down
Loading