Skip to content

Commit 1d5ace6

Browse files
author
Yue Tu
committed
add functions, tests and docs for github creds
1 parent 5d8f2b7 commit 1d5ace6

File tree

8 files changed

+610
-106
lines changed

8 files changed

+610
-106
lines changed

doc/overview.rst

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -185,16 +185,37 @@ Here is an example:
185185
186186
Git Support
187187
-----------
188-
If you have your training scripts in your GitHub repository, you can use them directly without the trouble to download
189-
them to local machine. Git support can be enabled simply by providing ``git_config`` parameter when initializing an
190-
estimator. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies`` should all be relative
191-
paths in the Git repo. Note that if you decided to use Git support, then everything you need for ``entry_point``,
192-
``source_dir`` and ``dependencies`` should be in a single Git repo.
188+
If you have your training scripts in your GitHub (or GitHub-like) repository, you can use them directly without the
189+
trouble to download them to local machine. Git support can be enabled simply by providing ``git_config`` parameter
190+
when initializing an estimator. If Git support is enabled, then ``entry_point``, ``source_dir`` and ``dependencies``
191+
should all be relative paths in the Git repo. Note that if you decided to use Git support, then everything you need
192+
for ``entry_point``, ``source_dir`` and ``dependencies`` should be in a single Git repo.
193193

194-
Here are ways to specify ``git_config``:
194+
The ``git_config`` parameter includes arguments ``repo``, ``branch``, ``commit``, ``2FA_enabled``, ``username``,
195+
``password`` and ``token``. Except for ``repo``, the other arguments are optional. ``repo`` specifies the Git repository
196+
that you want to use. If ``branch`` is not provided, master branch will be used. If ``commit`` is not provided,
197+
the latest commit in the required branch will be used.
198+
199+
``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose. ``2FA_enabled`` should
200+
be ``True`` or ``False``, provides the information whether two-factor authentication is enabled for the GitHub (or GitHub-like) account.
201+
If ``2FA_enabled`` is not provided, we consider 2FA as disabled.
202+
203+
If ``repo`` is an ssh url, you should either have no passphrase for the ssh key pairs, or have the ssh-agent configured
204+
so that you will not be prompted for ssh passphrase when you do 'git clone' command with ssh urls. For ssh urls, it
205+
makes no difference whether the 2FA is enabled or disabled.
206+
207+
If ``repo`` is an https url, 2FA matters. When 2FA is disabled, either ``token`` or ``username``+``password`` will be
208+
used for authentication if provided (``token`` prioritized). When 2FA is enabled, only token will be used for
209+
authentication if provided. If required authentication info is not provided, python SDK will try to use local
210+
credentials storage to authenticate. If that fails either, an error message will be thrown.
211+
212+
Here are some ways to specify ``git_config``:
195213

196214
.. code:: python
197215
216+
# The following three examples do not provide Git credentials, so python SDK will try to use
217+
# local credential storage.
218+
198219
# Specifies the git_config parameter
199220
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
200221
'branch': 'branch1',
@@ -209,6 +230,17 @@ Here are ways to specify ``git_config``:
209230
# 'master' branch will be used.
210231
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git'}
211232
233+
# This example does not provide '2FA_enabled', so 2FA is treated as disabled by default. 'username' and
234+
# 'password' are provided for authentication
235+
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
236+
'username': 'username',
237+
'password': 'passw0rd!'}
238+
239+
# This example specifies that 2FA is enabled, and token is provided for authentication
240+
git_config = {'repo': 'https://github.com/username/repo-with-training-scripts.git',
241+
'2FA_enabled': True,
242+
'token': 'your-token'}
243+
212244
The following are some examples to define estimators with Git support:
213245

214246
.. code:: python
@@ -240,7 +272,6 @@ The following are some examples to define estimators with Git support:
240272
train_instance_count=1,
241273
train_instance_type='ml.c4.xlarge')
242274
243-
When Git support is enabled, users can still use local mode in the same way.
244275
245276
Training Metrics
246277
----------------

src/sagemaker/estimator.py

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,10 @@
2020
from abc import abstractmethod
2121
from six import with_metaclass
2222
from six import string_types
23-
2423
import sagemaker
2524
from sagemaker import git_utils
2625
from sagemaker.analytics import TrainingJobAnalytics
26+
2727
from sagemaker.fw_utils import (
2828
create_image_uri,
2929
tar_and_upload_dir,
@@ -976,10 +976,12 @@ def __init__(
976976
>>> |----- test.py
977977
978978
You can assign entry_point='src/train.py'.
979-
git_config (dict[str, str]): Git configurations used for cloning files, including 'repo', 'branch'
980-
and 'commit' (default: None).
981-
'branch' and 'commit' are optional. If 'branch' is not specified, 'master' branch will be used. If
982-
'commit' is not specified, the latest commit in the required branch will be used.
979+
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
980+
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token`` (default: None). The fields are
981+
optional except ``repo``. If ``branch`` is not specified, master branch will be used. If ``commit``
982+
is not specified, the latest commit in the required branch will be used. 'branch' and 'commit' are
983+
optional. If 'branch' is not specified, 'master' branch will be used. If 'commit' is not specified,
984+
the latest commit in the required branch will be used.
983985
Example:
984986
985987
The following config:
@@ -990,6 +992,14 @@ def __init__(
990992
991993
results in cloning the repo specified in 'repo', then checkout the 'master' branch, and checkout
992994
the specified commit.
995+
``2FA_enabled``, ``username``, ``password`` and ``token`` are for authentication purpose. If
996+
``2FA_enabled`` is not provided, we consider 2FA as disabled. For GitHub and GitHub-like repos, when
997+
ssh urls are provided, it does not make a difference whether 2FA is enabled or disabled; an ssh
998+
passphrase should be in local storage. When https urls are provided: if 2FA is disabled, then either
999+
token or username+password will be used for authentication if provided (token prioritized); if 2FA is
1000+
enabled, only token will be used for authentication if provided. If required authentication info is
1001+
not provided, python SDK will try to use local credentials storage to authenticate. If that fails
1002+
either, an error message will be thrown.
9931003
source_dir (str): Path (absolute or relative) to a directory with any other training
9941004
source code dependencies aside from the entry point file (default: None). Structure within this
9951005
directory are preserved when training on Amazon SageMaker. If 'git_config' is provided,

src/sagemaker/git_utils.py

Lines changed: 176 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,27 @@
1616
import six
1717
import subprocess
1818
import tempfile
19+
import warnings
20+
from six.moves import urllib
1921

2022

2123
def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
2224
"""Git clone repo containing the training code and serving code. This method also validate ``git_config``,
2325
and set ``entry_point``, ``source_dir`` and ``dependencies`` to the right file or directory in the repo cloned.
2426
2527
Args:
26-
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
27-
and ``commit``. ``branch`` and ``commit`` are optional. If ``branch`` is not specified, master branch
28-
will be used. If ``commit`` is not specified, the latest commit in the required branch will be used.
28+
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``,
29+
``commit``, ``2FA_enabled``, ``username``, ``password`` and ``token``. The fields are optional except
30+
``repo``. If ``branch`` is not specified, master branch will be used. If ``commit`` is not specified,
31+
the latest commit in the required branch will be used. ``2FA_enabled``, ``username``, ``password`` and
32+
``token`` are for authentication purpose.
33+
If ``2FA_enabled`` is not provided, we consider 2FA as disabled. For GitHub and GitHub-like repos, when
34+
ssh urls are provided, it does not make a difference whether 2FA is enabled or disabled; an ssh passphrase
35+
should be in local storage. When https urls are provided: if 2FA is disabled, then either token or
36+
username+password will be used for authentication if provided (token prioritized); if 2FA is enabled,
37+
only token will be used for authentication if provided. If required authentication info is not provided,
38+
python SDK will try to use local credentials storage to authenticate. If that fails either, an error message
39+
will be thrown.
2940
entry_point (str): A relative location to the Python source file which should be executed as the entry point
3041
to training or model hosting in the Git repo.
3142
source_dir (str): A relative location to a directory with other training or model hosting source code
@@ -41,16 +52,14 @@ def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
4152
ValueError: If 1. entry point specified does not exist in the repo
4253
2. source dir specified does not exist in the repo
4354
3. dependencies specified do not exist in the repo
44-
4. git_config is in bad format
55+
4. wrong format is provided for git_config
4556
4657
Returns:
47-
dict: A dict that contains the updated values of entry_point, source_dir and dependencies
58+
dict: A dict that contains the updated values of entry_point, source_dir and dependencies.
4859
"""
49-
if entry_point is None:
50-
raise ValueError("Please provide an entry point.")
5160
_validate_git_config(git_config)
5261
repo_dir = tempfile.mkdtemp()
53-
subprocess.check_call(["git", "clone", git_config["repo"], repo_dir])
62+
_generate_and_run_clone_command(git_config, repo_dir)
5463

5564
_checkout_branch_and_commit(git_config, repo_dir)
5665

@@ -72,44 +81,182 @@ def git_clone_repo(git_config, entry_point, source_dir=None, dependencies=None):
7281
updated_paths["entry_point"] = os.path.join(repo_dir, entry_point)
7382
else:
7483
raise ValueError("Entry point does not exist in the repo.")
75-
76-
updated_paths["dependencies"] = []
77-
for path in dependencies:
78-
if os.path.exists(os.path.join(repo_dir, path)):
79-
updated_paths["dependencies"].append(os.path.join(repo_dir, path))
80-
else:
81-
raise ValueError("Dependency {} does not exist in the repo.".format(path))
84+
if dependencies is None:
85+
updated_paths["dependencies"] = None
86+
else:
87+
updated_paths["dependencies"] = []
88+
for path in dependencies:
89+
if os.path.exists(os.path.join(repo_dir, path)):
90+
updated_paths["dependencies"].append(os.path.join(repo_dir, path))
91+
else:
92+
raise ValueError("Dependency {} does not exist in the repo.".format(path))
8293
return updated_paths
8394

8495

8596
def _validate_git_config(git_config):
86-
"""check if a git_config param is valid
97+
if "repo" not in git_config:
98+
raise ValueError("Please provide a repo for git_config.")
99+
string_args = ["repo", "branch", "commit", "username", "password", "token"]
100+
for key in string_args:
101+
if key in git_config and not isinstance(git_config[key], six.string_types):
102+
raise ValueError("'{}' must be a string.".format(key))
103+
if "2FA_enabled" in git_config and not isinstance(git_config["2FA_enabled"], bool):
104+
raise ValueError("'2FA_enabled' must be a bool value.")
105+
allowed_keys = ["repo", "branch", "commit", "2FA_enabled", "username", "password", "token"]
106+
for k in list(git_config):
107+
if k not in allowed_keys:
108+
raise ValueError("Unexpected git_config argument(s) provided!")
109+
110+
111+
def _generate_and_run_clone_command(git_config, repo_dir):
112+
"""check if a git_config param is valid, if it is, create the command to git clone the repo, and run it.
87113
88114
Args:
89115
git_config ((dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
90116
and ``commit``.
117+
repo_dir (str): The local directory to clone the Git repo into.
91118
92119
Raises:
93-
ValueError: If:
94-
1. git_config has no key 'repo'
95-
2. git_config['repo'] is in the wrong format.
120+
CalledProcessError: If failed to clone git repo.
96121
"""
97-
if "repo" not in git_config:
98-
raise ValueError("Please provide a repo for git_config.")
99-
allowed_keys = ["repo", "branch", "commit"]
100-
for key in allowed_keys:
101-
if key in git_config and not isinstance(git_config[key], six.string_types):
102-
raise ValueError("'{}' should be a string".format(key))
103-
for key in git_config:
104-
if key not in allowed_keys:
105-
raise ValueError("Unexpected argument(s) provided for git_config!")
122+
exists = {
123+
"2FA_enabled": "2FA_enabled" in git_config and git_config["2FA_enabled"] is True,
124+
"username": "username" in git_config,
125+
"password": "password" in git_config,
126+
"token": "token" in git_config,
127+
}
128+
_clone_command_for_github_like(git_config, repo_dir, exists)
129+
130+
131+
def _clone_command_for_github_like(git_config, repo_dir, exists):
132+
"""check if a git_config param representing a GitHub (or like) repo is valid, if it is, create the command to
133+
git clone the repo, and run it.
134+
135+
Args:
136+
git_config ((dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
137+
and ``commit``.
138+
repo_dir (str): The local directory to clone the Git repo into.
139+
140+
Raises:
141+
ValueError: If git_config['repo'] is in the wrong format.
142+
CalledProcessError: If failed to clone git repo.
143+
"""
144+
is_https = git_config["repo"].startswith("https://")
145+
is_ssh = git_config["repo"].startswith("git@")
146+
if not is_https and not is_ssh:
147+
raise ValueError("Invalid Git url provided.")
148+
if is_ssh:
149+
_clone_command_for_github_like_ssh(git_config, repo_dir, exists)
150+
elif exists["2FA_enabled"]:
151+
_clone_command_for_github_like_https_2fa_enabled(git_config, repo_dir, exists)
152+
else:
153+
_clone_command_for_github_like_https_2fa_disabled(git_config, repo_dir, exists)
154+
155+
156+
def _clone_command_for_github_like_ssh(git_config, repo_dir, exists):
157+
if exists["username"] or exists["password"] or exists["token"]:
158+
warnings.warn("Unnecessary credential argument(s) provided.")
159+
_run_clone_command(git_config["repo"], repo_dir)
160+
161+
162+
def _clone_command_for_github_like_https_2fa_disabled(git_config, repo_dir, exists):
163+
updated_url = git_config["repo"]
164+
if exists["token"]:
165+
if exists["username"] or exists["password"]:
166+
warnings.warn(
167+
"Using token for authentication, "
168+
"but unnecessary credential argument(s) provided."
169+
)
170+
updated_url = _insert_token_to_repo_url(url=git_config["repo"], token=git_config["token"])
171+
elif exists["username"] and exists["password"]:
172+
updated_url = _insert_username_and_password_to_repo_url(
173+
url=git_config["repo"], username=git_config["username"], password=git_config["password"]
174+
)
175+
elif exists["username"] or exists["password"]:
176+
warnings.warn("Unnecessary credential argument(s) provided.")
177+
_run_clone_command(updated_url, repo_dir)
178+
179+
180+
def _clone_command_for_github_like_https_2fa_enabled(git_config, repo_dir, exists):
181+
updated_url = git_config["repo"]
182+
if exists["token"]:
183+
if exists["username"] or exists["password"]:
184+
warnings.warn(
185+
"Using token for authentication, "
186+
"but unnecessary credential argument(s) provided."
187+
)
188+
updated_url = _insert_token_to_repo_url(url=git_config["repo"], token=git_config["token"])
189+
elif exists["username"] or exists["password"] or exists["token"]:
190+
warnings.warn(
191+
"Unnecessary credential argument(s) provided."
192+
"Hint: since two factor authentication is enabled, you have to provide token."
193+
)
194+
_run_clone_command(updated_url, repo_dir)
195+
196+
197+
def _run_clone_command(repo_url, repo_dir):
198+
"""Run the 'git clone' command with the repo url and the directory to clone the repo into.
199+
200+
Args:
201+
repo_url (str): Git repo url to be cloned.
202+
repo_dir: (str): Local path where the repo should be cloned into.
203+
204+
Raises:
205+
CalledProcessError: If failed to clone git repo.
206+
"""
207+
my_env = os.environ.copy()
208+
if repo_url.startswith("https://"):
209+
my_env["GIT_TERMINAL_PROMPT"] = "0"
210+
elif repo_url.startswith("git@"):
211+
f = tempfile.NamedTemporaryFile()
212+
w = open(f.name, "w")
213+
w.write("ssh -oBatchMode=yes $@")
214+
w.close()
215+
# 511 in decimal is same as 777 in octal
216+
os.chmod(f.name, 511)
217+
my_env["GIT_SSH"] = f.name
218+
subprocess.check_call(["git", "clone", repo_url, repo_dir], env=my_env)
219+
220+
221+
def _insert_token_to_repo_url(url, token):
222+
"""Insert the token to the Git repo url, to make a component of the git clone command. This method can
223+
only be called when repo_url is an https url.
224+
225+
Args:
226+
url (str): Git repo url where the token should be inserted into.
227+
token (str): Token to be inserted.
228+
229+
Returns:
230+
str: the component needed fot the git clone command.
231+
"""
232+
index = len("https://")
233+
return url[:index] + token + "@" + url[index:]
234+
235+
236+
def _insert_username_and_password_to_repo_url(url, username, password):
237+
"""Insert the username and the password to the Git repo url, to make a component of the git clone command.
238+
This method can only be called when repo_url is an https url.
239+
240+
Args:
241+
url (str): Git repo url where the token should be inserted into.
242+
username (str): Username to be inserted.
243+
password (str): Password to be inserted.
244+
245+
Returns:
246+
str: the component needed fot the git clone command.
247+
"""
248+
password = urllib.parse.quote_plus(password)
249+
# urllib parses ' ' as '+', but what we need is '%20' here
250+
password = password.replace("+", "%20")
251+
index = len("https://")
252+
return url[:index] + username + ":" + password + "@" + url[index:]
106253

107254

108255
def _checkout_branch_and_commit(git_config, repo_dir):
109256
"""Checkout the required branch and commit.
110257
111258
Args:
112-
git_config: (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
259+
git_config (dict[str, str]): Git configurations used for cloning files, including ``repo``, ``branch``
113260
and ``commit``.
114261
repo_dir (str): the directory where the repo is cloned
115262

0 commit comments

Comments
 (0)