Skip to content

feature: deal with credentials for Git support for GitHub #914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 11, 2019
85 changes: 61 additions & 24 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,45 +183,57 @@ Here is an example:
# When you are done using your endpoint
algo.delete_endpoint()

Git Support
-----------
If you have your training scripts in your GitHub repository, you can use them directly without the trouble to download
them to local machine. Git support can be enabled simply by providing ``git_config`` parameter when initializing an
estimator. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies`` should all be relative
paths in the Git repo. Note that if you decided to use Git support, then everything you need for ``entry_point``,
``source_dir`` and ``dependencies`` should be in a single Git repo.
Use Scripts Stored in a Git Repository
--------------------------------------
When you create an estimator, you can specify a training script that is stored in a GitHub or other Git repository as the entry point for the estimator, so that you don't have to download the scripts locally.
If you do so, source directory and dependencies should be in the same repo if they are needed. Git support can be enabled simply by providing ``git_config`` parameter
when creating an ``Estimator`` object. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies``
should be relative paths in the Git repo if provided.

Here are ways to specify ``git_config``:
The ``git_config`` parameter includes fields ``repo``, ``branch``, ``commit``, ``2FA_enabled``, ``username``,
``password`` and ``token``. The ``repo`` field is required. All other fields are optional. ``repo`` specifies the Git
repository where your training script is stored. If you don't provide ``branch``, the default value 'master' is used.
If you don't provide ``commit``, the latest commit in the specified branch is used.

.. code:: python
``2FA_enabled``, ``username``, ``password`` and ``token`` are used for authentication. Set ``2FA_enabled`` to 'True' if
two-factor authentication is enabled for the GitHub (or other Git) account, otherwise set it to 'False'.
If you do not provide a value for ``2FA_enabled``, a default value of 'False' is used.

# Specifies the git_config parameter
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'branch': 'branch1',
'commit': '4893e528afa4a790331e1b5286954f073b0f14a2'}

# Alternatively, you can also specify git_config by providing only 'repo' and 'branch'.
# If this is the case, the latest commit in the branch will be used.
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'branch': 'branch1'}
If ``repo`` is an SSH URL, you should either have no passphrase for the SSH key pairs, or have the ``ssh-agent`` configured
so that you are not prompted for the SSH passphrase when you run a ``git clone`` command with SSH URLs. For SSH URLs, it
does not matter whether two-factor authentication is enabled.

# Only providing 'repo' is also allowed. If this is the case, latest commit in
# 'master' branch will be used.
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git'}
If ``repo`` is an https URL, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
used for authentication if provided (``token`` prioritized). When 2FA is enabled, only token will be used for
authentication if provided. If required authentication info is not provided, python SDK will try to use local
credentials storage to authenticate. If that fails either, an error message will be thrown.

The following are some examples to define estimators with Git support:
Here are some examples of creating estimators with Git support:

.. code:: python

# Specifies the git_config parameter. This example does not provide Git credentials, so python SDK will try
# to use local credential storage.
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'branch': 'branch1',
'commit': '4893e528afa4a790331e1b5286954f073b0f14a2'}

# In this example, the source directory 'pytorch' contains the entry point 'mnist.py' and other source code.
# and it is relative path inside the Git repo.
# and it is relative path inside the Git repo.
pytorch_estimator = PyTorch(entry_point='mnist.py',
role='SageMakerRole',
source_dir='pytorch',
git_config=git_config,
train_instance_count=1,
train_instance_type='ml.c4.xlarge')

.. code:: python

# You can also specify git_config by providing only 'repo' and 'branch'.
# If this is the case, the latest commit in that branch will be used.
git_config = {'repo': '[email protected]:username/repo-with-training-scripts.git',
'branch': 'branch1'}

# In this example, the entry point 'mnist.py' is all we need for source code.
# We need to specify the path to it in the Git repo.
mx_estimator = MXNet(entry_point='mxnet/mnist.py',
Expand All @@ -230,6 +242,15 @@ The following are some examples to define estimators with Git support:
train_instance_count=1,
train_instance_type='ml.c4.xlarge')

.. code:: python

# Only providing 'repo' is also allowed. If this is the case, latest commit in 'master' branch will be used.
# This example does not provide '2FA_enabled', so 2FA is treated as disabled by default. 'username' and
# 'password' are provided for authentication
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'username': 'username',
'password': 'passw0rd!'}

# In this example, besides entry point and other source code in source directory, we still need some
# dependencies for the training job. Dependencies should also be paths inside the Git repo.
pytorch_estimator = PyTorch(entry_point='mnist.py',
Expand All @@ -240,7 +261,23 @@ The following are some examples to define estimators with Git support:
train_instance_count=1,
train_instance_type='ml.c4.xlarge')

When Git support is enabled, users can still use local mode in the same way.
.. code:: python

# This example specifies that 2FA is enabled, and token is provided for authentication
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'2FA_enabled': True,
'token': 'your-token'}

# In this exmaple, besides entry point, we also need some dependencies for the training job.
pytorch_estimator = PyTorch(entry_point='pytorch/mnist.py',
role='SageMakerRole',
dependencies=['dep.py'],
git_config=git_config,
train_instance_count=1,
train_instance_type='local')

Git support can be used not only for training jobs, but also for hosting models. The usage is the same as the above,
and ``git_config`` should be provided when creating model objects, e.g. ``TensorFlowModel``, ``MXNetModel``, ``PyTorchModel``.

Training Metrics
----------------
Expand Down
21 changes: 16 additions & 5 deletions src/sagemaker/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@
from abc import abstractmethod
from six import with_metaclass
from six import string_types

import sagemaker
from sagemaker import git_utils
from sagemaker.analytics import TrainingJobAnalytics

from sagemaker.fw_utils import (
create_image_uri,
tar_and_upload_dir,
Expand Down Expand Up @@ -976,10 +976,12 @@ def __init__(
>>> |----- test.py

You can assign entry_point='src/train.py'.
git_config (dict[str, str]): Git configurations used for cloning files, including 'repo', 'branch'
and 'commit' (default: None).
'branch' and 'commit' are optional. If 'branch' is not specified, 'master' branch will be used. If
'commit' is not specified, the latest commit in the required branch will be used.
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token`` (default: None). The fields are
optional except ``repo``. If ``branch`` is not specified, master branch will be used. If ``commit``
is not specified, the latest commit in the required branch will be used. 'branch' and 'commit' are
optional. If 'branch' is not specified, 'master' branch will be used. If 'commit' is not specified,
the latest commit in the required branch will be used.
Example:

The following config:
Expand All @@ -990,6 +992,15 @@ def __init__(

results in cloning the repo specified in 'repo', then checkout the 'master' branch, and checkout
the specified commit.
``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose.
``2FA_enabled`` must be ``True`` or ``False`` if it is provided. If ``2FA_enabled`` is not provided,
we consider 2FA as disabled. For GitHub and other Git repos, when ssh urls are provided, it does not
make a difference whether 2FA is enabled or disabled; an ssh passphrase should be in local storage.
When https urls are provided: if 2FA is disabled, then either token or username+password will
be used for authentication if provided (token prioritized); if 2FA is enabled, only token will
be used for authentication if provided. If required authentication info is not provided, python SDK
will try to use local credentials storage to authenticate. If that fails either, an error message will
be thrown.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from the entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker. If 'git_config' is provided,
Expand Down
Loading