Skip to content

feature: deal with credentials for Git support for GitHub #914

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 11, 2019
47 changes: 40 additions & 7 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,16 +185,37 @@ Here is an example:

Git Support
-----------
If you have your training scripts in your GitHub repository, you can use them directly without the trouble to download
them to local machine. Git support can be enabled simply by providing ``git_config`` parameter when initializing an
estimator. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies`` should all be relative
paths in the Git repo. Note that if you decided to use Git support, then everything you need for ``entry_point``,
``source_dir`` and ``dependencies`` should be in a single Git repo.
If you have your training scripts or in your GitHub (or other Git) repository, you can use them directly without the
trouble to download them locally. Git support can be enabled simply by providing ``git_config`` parameter
when creating an ``Estimator`` object. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies``
should all be relative paths in the Git repo if provided. Note that if you decided to use Git support, then all your
training scripts should be in a single Git repo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? "All" across what set? Do you mean if I've specified more than one training script for an estimator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one entry point for the estimator, but the script could import other modules. If so, the other modules should be in a directory specified by 'source_dir'. I have slightly changed this expression and moved it to the second sentence in the paragraph.


Here are ways to specify ``git_config``:
The ``git_config`` parameter includes arguments ``repo``, ``branch``, ``commit``, ``2FA_enabled``, ``username``,
``password`` and ``token``. Except for ``repo``, the other arguments are optional. ``repo`` specifies the Git repository
that you want to use. If ``branch`` is not provided, master branch will be used. If ``commit`` is not provided,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo specifies the Git repository where your training script is stored. If you don't provide branch, the default value ''master'' is used. If you don't provide commit, the latest commit in the specified branch is used.

the latest commit in the required branch will be used.

``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose. ``2FA_enabled`` should
be 'True' or 'False', providing the information whether two-factor authentication is enabled for the GitHub (or other Git) account.
If ``2FA_enabled`` is not provided, we consider 2FA as disabled.

If ``repo`` is an ssh url, you should either have no passphrase for the ssh key pairs, or have the ssh-agent configured
so that you will not be prompted for ssh passphrase when you do 'git clone' command with ssh urls. For ssh urls, it
makes no difference whether the 2FA is enabled or disabled.

If ``repo`` is an https url, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
used for authentication if provided (``token`` prioritized). When 2FA is enabled, only token will be used for
authentication if provided. If required authentication info is not provided, python SDK will try to use local
credentials storage to authenticate. If that fails either, an error message will be thrown.

Here are some ways to specify ``git_config``:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ask @eslesar-aws to review the writing.


.. code:: python

# The following three examples do not provide Git credentials, so python SDK will try to use
# local credential storage.

# Specifies the git_config parameter
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'branch': 'branch1',
Expand All @@ -209,6 +230,17 @@ Here are ways to specify ``git_config``:
# 'master' branch will be used.
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git'}

# This example does not provide '2FA_enabled', so 2FA is treated as disabled by default. 'username' and
# 'password' are provided for authentication
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'username': 'username',
'password': 'passw0rd!'}

# This example specifies that 2FA is enabled, and token is provided for authentication
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
'2FA_enabled': True,
'token': 'your-token'}

The following are some examples to define estimators with Git support:

.. code:: python
Expand Down Expand Up @@ -240,7 +272,8 @@ The following are some examples to define estimators with Git support:
train_instance_count=1,
train_instance_type='ml.c4.xlarge')

When Git support is enabled, users can still use local mode in the same way.
Git support can be used not only for training jobs, but also for hosting models. The usage is the same as the above,
and ``git_config`` should be provided when creating the ``FrameworkModel`` object.

Training Metrics
----------------
Expand Down
21 changes: 16 additions & 5 deletions src/sagemaker/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@
from abc import abstractmethod
from six import with_metaclass
from six import string_types

import sagemaker
from sagemaker import git_utils
from sagemaker.analytics import TrainingJobAnalytics

from sagemaker.fw_utils import (
create_image_uri,
tar_and_upload_dir,
Expand Down Expand Up @@ -976,10 +976,12 @@ def __init__(
>>> |----- test.py

You can assign entry_point='src/train.py'.
git_config (dict[str, str]): Git configurations used for cloning files, including 'repo', 'branch'
and 'commit' (default: None).
'branch' and 'commit' are optional. If 'branch' is not specified, 'master' branch will be used. If
'commit' is not specified, the latest commit in the required branch will be used.
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token`` (default: None). The fields are
optional except ``repo``. If ``branch`` is not specified, master branch will be used. If ``commit``
is not specified, the latest commit in the required branch will be used. 'branch' and 'commit' are
optional. If 'branch' is not specified, 'master' branch will be used. If 'commit' is not specified,
the latest commit in the required branch will be used.
Example:

The following config:
Expand All @@ -990,6 +992,15 @@ def __init__(

results in cloning the repo specified in 'repo', then checkout the 'master' branch, and checkout
the specified commit.
``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose.
``2FA_enabled`` must be 'True' or 'False' if it is provided. If ``2FA_enabled`` is not provided,
we consider 2FA as disabled. For GitHub and other Git repos, when ssh urls are provided, it does not
make a difference whether 2FA is enabled or disabled; an ssh passphrase should be in local storage.
When https urls are provided: if 2FA is disabled, then either token or username+password will
be used for authentication if provided (token prioritized); if 2FA is enabled, only token will
be used for authentication if provided. If required authentication info is not provided, python SDK
will try to use local credentials storage to authenticate. If that fails either, an error message will
be thrown.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from the entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker. If 'git_config' is provided,
Expand Down
Loading