Skip to content

Commit 90fd4a4

Browse files
authored
Merge branch 'dev' into remote_docker_host
2 parents 6f6f399 + d610bfb commit 90fd4a4

28 files changed

+3472
-3090
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Changelog
22

3+
## v2.77.1 (2022-02-25)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* jumpstart model table
8+
39
## v2.77.0 (2022-02-22)
410

511
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.77.1.dev0
1+
2.77.2.dev0

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -243,16 +243,25 @@ TensorFlow API
243243

244244
.. function:: smdistributed.dataparallel.tensorflow.allreduce(tensor, param_index, num_params, compression=Compression.none, op=ReduceOp.AVERAGE)
245245

246-
Performs an all-reduce operation on a tensor (``tf.Tensor``).
246+
Performs an ``allreduce`` operation on a tensor (``tf.Tensor``).
247+
248+
The ``smdistributed.dataparallel`` package's AllReduce API for TensorFlow to allreduce
249+
gradient tensors. By default, ``smdistributed.dataparallel`` allreduce averages the
250+
gradient tensors across participating workers.
251+
252+
.. note::
253+
254+
:class:`smdistributed.dataparallel.tensorflow.allreduce()` should
255+
only be used to allreduce gradient tensors.
256+
For other (non-gradient) tensors, you must use
257+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`.
258+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
259+
for non-gradient tensors,
260+
the distributed training job might stall or stop.
247261

248-
``smdistributed.dataparallel`` AllReduce API can be used for all
249-
reducing gradient tensors or any other tensors. By
250-
default, ``smdistributed.dataparallel`` AllReduce averages the
251-
tensors across the participating workers.
252-
253262
**Inputs:**
254263

255-
- ``tensor (tf.Tensor)(required)``: The tensor to be all-reduced. The shape of the input must be identical across all ranks.
264+
- ``tensor (tf.Tensor)(required)``: The tensor to be allreduced. The shape of the input must be identical across all ranks.
256265
- ``param_index (int)(required):`` 0 if you are reducing a single tensor. Index of the tensor if you are reducing a list of tensors.
257266
- ``num_params (int)(required):`` len(tensor).
258267
- ``compression (smdistributed.dataparallel.tensorflow.Compression)(optional)``: Compression algorithm used to reduce the amount of data sent and received by each worker node. Defaults to not using compression.
@@ -306,9 +315,9 @@ TensorFlow API
306315

307316
.. function:: smdistributed.dataparallel.tensorflow.oob_allreduce(tensor, compression=Compression.none, op=ReduceOp.AVERAGE)
308317

309-
OutOfBand (oob) AllReduce is simplified AllReduce function for use cases
318+
Out-of-band (oob) AllReduce is simplified AllReduce function for use-cases
310319
such as calculating total loss across all the GPUs in the training.
311-
oob_allreduce average the tensors, as reduction operation, across the
320+
``oob_allreduce`` average the tensors, as reduction operation, across the
312321
worker nodes.
313322

314323
**Inputs:**
@@ -326,15 +335,25 @@ TensorFlow API
326335

327336
- ``None``
328337

329-
.. rubric:: Notes
330-
331-
``smdistributed.dataparallel.tensorflow.oob_allreduce``, in most
332-
cases, is ~2x slower
333-
than ``smdistributed.dataparallel.tensorflow.allreduce``  so it is not
334-
recommended to be used for performing gradient reduction during the
335-
training
336-
process. ``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
337-
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
338+
.. note::
339+
340+
In most cases, the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
341+
function is ~2x slower
342+
than :class:`smdistributed.dataparallel.tensorflow.allreduce()`. It is not
343+
recommended to use the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
344+
function for performing gradient
345+
reduction during the training process.
346+
``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
347+
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
348+
349+
.. note::
350+
351+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()` should
352+
only be used to allreduce non-gradient tensors.
353+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
354+
for non-gradient tensors,
355+
the distributed training job might stall or stop.
356+
To allreduce gradients, use :class:`smdistributed.dataparallel.tensorflow.allreduce()`.
338357

339358

340359
.. function:: smdistributed.dataparallel.tensorflow.overlap(tensor)

src/sagemaker/image_uri_config/neo-tensorflow.json

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
"1.11.0": "1.15.3",
1313
"1.12.0": "1.15.3",
1414
"1.13.0": "1.15.3",
15-
"1.14.0": "1.15.3"
15+
"1.14.0": "1.15.3",
16+
"2.4.2": "2.4.2"
1617
},
1718
"versions": {
1819
"1.15.3": {
@@ -44,6 +45,36 @@
4445
"us-west-2": "301217895009"
4546
},
4647
"repository": "sagemaker-inference-tensorflow"
48+
},
49+
"2.4.2": {
50+
"py_versions": ["py3"],
51+
"registries": {
52+
"af-south-1": "774647643957",
53+
"ap-east-1": "110948597952",
54+
"ap-northeast-1": "941853720454",
55+
"ap-northeast-2": "151534178276",
56+
"ap-northeast-3": "925152966179",
57+
"ap-south-1": "763008648453",
58+
"ap-southeast-1": "324986816169",
59+
"ap-southeast-2": "355873309152",
60+
"ca-central-1": "464438896020",
61+
"cn-north-1": "472730292857",
62+
"cn-northwest-1": "474822919863",
63+
"eu-central-1": "746233611703",
64+
"eu-north-1": "601324751636",
65+
"eu-south-1": "966458181534",
66+
"eu-west-1": "802834080501",
67+
"eu-west-2": "205493899709",
68+
"eu-west-3": "254080097072",
69+
"me-south-1": "836785723513",
70+
"sa-east-1": "756306329178",
71+
"us-east-1": "785573368785",
72+
"us-east-2": "007439368137",
73+
"us-gov-west-1": "263933020539",
74+
"us-west-1": "710691900526",
75+
"us-west-2": "301217895009"
76+
},
77+
"repository": "sagemaker-inference-tensorflow"
4778
}
4879
}
4980
}

src/sagemaker/model.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,7 @@ def _upload_code(self, key_prefix: str, repack: bool = False) -> None:
466466
)
467467

468468
def _script_mode_env_vars(self):
469-
"""Placeholder docstring"""
469+
"""Returns a mapping of environment variables for script mode execution"""
470470
script_name = None
471471
dir_name = None
472472
if self.uploaded_code:
@@ -478,8 +478,11 @@ def _script_mode_env_vars(self):
478478
elif self.entry_point is not None:
479479
script_name = self.entry_point
480480
if self.source_dir is not None:
481-
dir_name = "file://" + self.source_dir
482-
481+
dir_name = (
482+
self.source_dir
483+
if self.source_dir.startswith("s3://")
484+
else "file://" + self.source_dir
485+
)
483486
return {
484487
SCRIPT_PARAM_NAME.upper(): script_name or str(),
485488
DIR_PARAM_NAME.upper(): dir_name or str(),

src/sagemaker/serializers.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -381,10 +381,9 @@ def serialize(self, data):
381381
"""
382382
if isinstance(data, str):
383383
try:
384-
dataFile = open(data, "rb")
385-
dataFileInfo = dataFile.read()
386-
dataFile.close()
387-
return dataFileInfo
384+
with open(data, "rb") as data_file:
385+
data_file_info = data_file.read()
386+
return data_file_info
388387
except Exception as e:
389388
raise ValueError(f"Could not open/read file: {data}. {e}")
390389
if isinstance(data, bytes):

src/sagemaker/workflow/steps.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,8 @@ def arguments(self) -> RequestType:
301301
)
302302
request_dict = self.estimator.sagemaker_session._get_train_request(**train_args)
303303
request_dict.pop("TrainingJobName")
304+
if "HyperParameters" in request_dict:
305+
request_dict["HyperParameters"].pop("sagemaker_job_name", None)
304306

305307
return request_dict
306308

tests/integ/__init__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -148,12 +148,6 @@
148148
"eu-west-2",
149149
"us-east-1",
150150
]
151-
NO_SM_PIPELINE_MM_CLARIFY_CHECK_STEP_REGIONS = [
152-
"ap-northeast-3",
153-
"ap-south-1",
154-
"eu-north-1",
155-
"sa-east-1",
156-
]
157151
EDGE_PACKAGING_SUPPORTED_REGIONS = [
158152
"us-east-2",
159153
"us-west-2",

tests/integ/sagemaker/lineage/conftest.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@
2626
artifact,
2727
)
2828
from sagemaker.model import ModelPackage
29-
from tests.integ.test_workflow import test_end_to_end_pipeline_successful_execution
29+
from tests.integ.sagemaker.workflow.test_workflow import (
30+
test_end_to_end_pipeline_successful_execution,
31+
)
3032
from sagemaker.workflow.pipeline import _PipelineExecution
3133
from sagemaker.session import get_execution_role
3234
from smexperiments import trial_component, trial, experiment

tests/integ/sagemaker/workflow/__init__.py

Whitespace-only changes.
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"). You
4+
# may not use this file except in compliance with the License. A copy of
5+
# the License is located at
6+
#
7+
# http://aws.amazon.com/apache2.0/
8+
#
9+
# or in the "license" file accompanying this file. This file is
10+
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11+
# ANY KIND, either express or implied. See the License for the specific
12+
# language governing permissions and limitations under the License.
13+
from __future__ import absolute_import
14+
15+
import re
16+
17+
import pytest
18+
19+
from sagemaker import get_execution_role, utils
20+
from sagemaker.workflow.callback_step import CallbackOutput, CallbackStep, CallbackOutputTypeEnum
21+
from sagemaker.workflow.parameters import ParameterInteger
22+
from sagemaker.workflow.pipeline import Pipeline
23+
24+
25+
@pytest.fixture
26+
def role(sagemaker_session):
27+
return get_execution_role(sagemaker_session)
28+
29+
30+
@pytest.fixture
31+
def pipeline_name():
32+
return utils.unique_name_from_base("my-pipeline-callback")
33+
34+
35+
@pytest.fixture
36+
def region_name(sagemaker_session):
37+
return sagemaker_session.boto_session.region_name
38+
39+
40+
def test_one_step_callback_pipeline(sagemaker_session, role, pipeline_name, region_name):
41+
instance_count = ParameterInteger(name="InstanceCount", default_value=2)
42+
43+
outputParam1 = CallbackOutput(output_name="output1", output_type=CallbackOutputTypeEnum.String)
44+
step_callback = CallbackStep(
45+
name="callback-step",
46+
sqs_queue_url="https://sqs.us-east-2.amazonaws.com/123456789012/MyQueue",
47+
inputs={"arg1": "foo"},
48+
outputs=[outputParam1],
49+
)
50+
51+
pipeline = Pipeline(
52+
name=pipeline_name,
53+
parameters=[instance_count],
54+
steps=[step_callback],
55+
sagemaker_session=sagemaker_session,
56+
)
57+
58+
try:
59+
response = pipeline.create(role)
60+
create_arn = response["PipelineArn"]
61+
assert re.match(
62+
rf"arn:aws:sagemaker:{region_name}:\d{{12}}:pipeline/{pipeline_name}",
63+
create_arn,
64+
)
65+
66+
pipeline.parameters = [ParameterInteger(name="InstanceCount", default_value=1)]
67+
response = pipeline.update(role)
68+
update_arn = response["PipelineArn"]
69+
assert re.match(
70+
rf"arn:aws:sagemaker:{region_name}:\d{{12}}:pipeline/{pipeline_name}",
71+
update_arn,
72+
)
73+
finally:
74+
try:
75+
pipeline.delete()
76+
except Exception:
77+
pass
78+
79+
80+
def test_two_step_callback_pipeline_with_output_reference(
81+
sagemaker_session, role, pipeline_name, region_name
82+
):
83+
instance_count = ParameterInteger(name="InstanceCount", default_value=2)
84+
85+
outputParam1 = CallbackOutput(output_name="output1", output_type=CallbackOutputTypeEnum.String)
86+
step_callback1 = CallbackStep(
87+
name="callback-step1",
88+
sqs_queue_url="https://sqs.us-east-2.amazonaws.com/123456789012/MyQueue",
89+
inputs={"arg1": "foo"},
90+
outputs=[outputParam1],
91+
)
92+
93+
step_callback2 = CallbackStep(
94+
name="callback-step2",
95+
sqs_queue_url="https://sqs.us-east-2.amazonaws.com/123456789012/MyQueue",
96+
inputs={"arg1": outputParam1},
97+
outputs=[],
98+
)
99+
100+
pipeline = Pipeline(
101+
name=pipeline_name,
102+
parameters=[instance_count],
103+
steps=[step_callback1, step_callback2],
104+
sagemaker_session=sagemaker_session,
105+
)
106+
107+
try:
108+
response = pipeline.create(role)
109+
create_arn = response["PipelineArn"]
110+
assert re.match(
111+
rf"arn:aws:sagemaker:{region_name}:\d{{12}}:pipeline/{pipeline_name}",
112+
create_arn,
113+
)
114+
finally:
115+
try:
116+
pipeline.delete()
117+
except Exception:
118+
pass

tests/integ/test_workflow_with_clarify_check_steps.py renamed to tests/integ/sagemaker/workflow/test_clarify_check_steps.py

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,6 @@
1919
import pytest
2020
from botocore.exceptions import WaiterError
2121

22-
import tests
2322
from sagemaker.clarify import (
2423
BiasConfig,
2524
DataConfig,
@@ -129,10 +128,6 @@ def data_bias_check_config(data_config, bias_config):
129128
)
130129

131130

132-
@pytest.mark.skipif(
133-
tests.integ.test_region() in tests.integ.NO_SM_PIPELINE_MM_CLARIFY_CHECK_STEP_REGIONS,
134-
reason=f"ClarifyCheckStep is not fully deployed in {tests.integ.test_region()}.",
135-
)
136131
def test_one_step_data_bias_pipeline_happycase(
137132
sagemaker_session,
138133
role,
@@ -220,10 +215,6 @@ def test_one_step_data_bias_pipeline_happycase(
220215
pass
221216

222217

223-
@pytest.mark.skipif(
224-
tests.integ.test_region() in tests.integ.NO_SM_PIPELINE_MM_CLARIFY_CHECK_STEP_REGIONS,
225-
reason=f"ClarifyCheckStep is not fully deployed in {tests.integ.test_region()}.",
226-
)
227218
def test_one_step_data_bias_pipeline_constraint_violation(
228219
sagemaker_session,
229220
role,

0 commit comments

Comments
 (0)