Skip to content

Commit cfe6e25

Browse files
authored
Merge branch 'master' into iss97027
2 parents 998ae97 + 131947f commit cfe6e25

23 files changed

+405
-89
lines changed

CHANGELOG.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,47 @@
11
# Changelog
22

3+
## v2.24.3 (2021-02-04)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* Remove pytest fixture and fix test_tag/s method
8+
9+
## v2.24.2 (2021-02-03)
10+
11+
### Bug Fixes and Other Changes
12+
13+
* use 3.5 version of get-pip.py
14+
* SM DDP release notes/changelog files
15+
16+
### Documentation Changes
17+
18+
* adding versioning to sm distributed data parallel docs
19+
20+
## v2.24.1 (2021-01-28)
21+
22+
### Bug Fixes and Other Changes
23+
24+
* fix collect-tests tox env
25+
* create profiler specific unsupported regions
26+
* Update smd_model_parallel_pytorch.rst
27+
28+
## v2.24.0 (2021-01-22)
29+
30+
### Features
31+
32+
* add support for Std:Join for pipelines
33+
* Map image name to image uri
34+
* friendly names for short URIs
35+
36+
### Bug Fixes and Other Changes
37+
38+
* increase allowed time for search to get updated
39+
* refactor distribution config construction
40+
41+
### Documentation Changes
42+
43+
* Add SMP 1.2.0 API docs
44+
345
## v2.23.6 (2021-01-20)
446

547
### Bug Fixes and Other Changes

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.23.7.dev0
1+
2.24.4.dev0
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
2+
Version 1.0.0 (Latest)
3+
======================
4+
5+
.. toctree::
6+
:maxdepth: 1
7+
8+
v1.0.0/smd_data_parallel_pytorch.rst
9+
v1.0.0/smd_data_parallel_tensorflow.rst

doc/api/training/smd_data_parallel.rst

Lines changed: 69 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -6,39 +6,36 @@ SageMaker's distributed data parallel library extends SageMaker’s training
66
capabilities on deep learning models with near-linear scaling efficiency,
77
achieving fast time-to-train with minimal code changes.
88

9-
- optimizes your training job for AWS network infrastructure and EC2 instance topology.
10-
- takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
11-
129
When training a model on a large amount of data, machine learning practitioners
1310
will often turn to distributed training to reduce the time to train.
1411
In some cases, where time is of the essence,
1512
the business requirement is to finish training as quickly as possible or at
1613
least within a constrained time period.
1714
Then, distributed training is scaled to use a cluster of multiple nodes,
1815
meaning not just multiple GPUs in a computing instance, but multiple instances
19-
with multiple GPUs. As the cluster size increases, so does the significant drop
20-
in performance. This drop in performance is primarily caused the communications
21-
overhead between nodes in a cluster.
16+
with multiple GPUs. However, as the cluster size increases, it is possible to see a significant drop
17+
in performance due to communications overhead between nodes in a cluster.
2218

23-
.. important::
24-
The distributed data parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
25-
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
26-
it uses CUDA 11. When you extend or customize your own training image
27-
you must use a CUDA 11 base image. See
28-
`SageMaker Python SDK's distributed data parallel library APIs
29-
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__
30-
for more information.
19+
SageMaker's distributed data parallel library addresses communications overhead in two ways:
3120

32-
.. rubric:: Customize your training script
21+
1. The library performs AllReduce, a key operation during distributed training that is responsible for a
22+
large portion of communication overhead.
23+
2. The library performs optimized node-to-node communication by fully utilizing AWS’s network
24+
infrastructure and Amazon EC2 instance topology.
3325

34-
To customize your own training script, you will need the following:
26+
To learn more about the core features of this library, see
27+
`Introduction to SageMaker's Distributed Data Parallel Library
28+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html>`_
29+
in the SageMaker Developer Guide.
3530

36-
.. raw:: html
31+
Use with the SageMaker Python SDK
32+
=================================
3733

38-
<div data-section-style="5" style="">
34+
To use the SageMaker distributed data parallel library with the SageMaker Python SDK, you will need the following:
3935

40-
- You must provide TensorFlow / PyTorch training scripts that are
41-
adapted to use the distributed data parallel library.
36+
- A TensorFlow or PyTorch training script that is
37+
adapted to use the distributed data parallel library. The :ref:`sdp_api_docs` includes
38+
framework specific examples of training scripts that are adapted to use this library.
4239
- Your input data must be in an S3 bucket or in FSx in the AWS region
4340
that you will use to launch your training job. If you use the Jupyter
4441
notebooks provided, create a SageMaker notebook instance in the same
@@ -47,26 +44,63 @@ To customize your own training script, you will need the following:
4744
the `SageMaker Python SDK data
4845
inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation.
4946

50-
.. raw:: html
47+
When you define
48+
a Pytorch or TensorFlow ``Estimator`` using the SageMaker Python SDK,
49+
you must select ``dataparallel`` as your ``distribution`` strategy:
50+
51+
.. code::
52+
53+
distribution = { "smdistributed": { "dataparallel": { "enabled": True } } }
54+
55+
We recommend you use one of the example notebooks as your template to launch a training job. When
56+
you use an example notebook you’ll need to swap your training script with the one that came with the
57+
notebook and modify any input functions as necessary. For instructions on how to get started using a
58+
Jupyter Notebook example, see `Distributed Training Jupyter Notebook Examples
59+
<https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training-notebook-examples.html>`_.
60+
61+
Once you have launched a training job, you can monitor it using CloudWatch. To learn more, see
62+
`Monitor and Analyze Training Jobs Using Metrics
63+
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html>`_.
64+
5165

52-
</div>
66+
After you train a model, you can see how to deploy your trained model to an endpoint for inference by
67+
following one of the `example notebooks for deploying a model
68+
<https://sagemaker-examples.readthedocs.io/en/latest/inference/index.html>`_.
69+
For more information, see `Deploy Models for Inference
70+
<https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html>`_.
5371

54-
Use the API guides for each framework to see
55-
examples of training scripts that can be used to convert your training scripts.
56-
Then use one of the example notebooks as your template to launch a training job.
57-
You’ll need to swap your training script with the one that came with the
58-
notebook and modify any input functions as necessary.
59-
Once you have launched a training job, you can monitor it using CloudWatch.
72+
.. _sdp_api_docs:
6073

61-
Then you can see how to deploy your trained model to an endpoint by
62-
following one of the example notebooks for deploying a model. Finally,
63-
you can follow an example notebook to test inference on your deployed
64-
model.
74+
API Documentation
75+
=================
6576

77+
This section contains the SageMaker distributed data parallel API documentation. If you are a
78+
new user of this library, it is recommended you use this guide alongside
79+
`SageMaker's Distributed Data Parallel Library
80+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html>`_.
6681

82+
Select a version to see the API documentation for version.
6783

6884
.. toctree::
69-
:maxdepth: 2
85+
:maxdepth: 1
86+
87+
sdp_versions/v1_0_0.rst
88+
89+
.. important::
90+
The distributed data parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
91+
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
92+
it uses CUDA 11. When you extend or customize your own training image
93+
you must use a CUDA 11 base image. See
94+
`SageMaker Python SDK's distributed data parallel library APIs
95+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
96+
for more information.
97+
98+
99+
Release Notes
100+
=============
101+
102+
New features, bug fixes, and improvements are regularly made to the SageMaker distributed data parallel library.
70103

71-
sdp_versions/smd_data_parallel_pytorch
72-
sdp_versions/smd_data_parallel_tensorflow
104+
To see the the latest changes made to the library, refer to the library
105+
`Release Notes
106+
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_data_parallel_release_notes/>`_.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Sagemaker Distributed Data Parallel 1.0.0 Release Notes
2+
3+
- First Release
4+
- Getting Started
5+
6+
## First Release
7+
8+
SageMaker's distributed data parallel library extends SageMaker’s training
9+
capabilities on deep learning models with near-linear scaling efficiency,
10+
achieving fast time-to-train with minimal code changes.
11+
SageMaker Distributed Data Parallel:
12+
13+
- optimizes your training job for AWS network infrastructure and EC2 instance topology.
14+
- takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
15+
16+
The library currently supports TensorFlow v2 and PyTorch via [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/).
17+
18+
## Getting Started
19+
20+
For getting started, refer to [SageMaker Distributed Data Parallel Python SDK Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api).

doc/api/training/smd_model_parallel.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,6 @@ across multiple GPUs with minimal code changes. The library's API can be accesse
1111

1212
Use the following sections to learn more about the model parallelism and the library.
1313

14-
.. important::
15-
The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
16-
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
17-
it uses CUDA 11. When you extend or customize your own training image
18-
you must use a CUDA 11 base image. See
19-
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
20-
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
21-
for more information.
22-
2314
Use with the SageMaker Python SDK
2415
=================================
2516

@@ -61,6 +52,15 @@ developer guide. This developer guide documentation includes:
6152
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`__
6253

6354

55+
.. important::
56+
The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
57+
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
58+
it uses CUDA 11. When you extend or customize your own training image
59+
you must use a CUDA 11 base image. See
60+
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
61+
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
62+
for more information.
63+
6464
Release Notes
6565
=============
6666

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -140,16 +140,16 @@ This API document assumes you use the following import statements in your traini
140140
computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
141141
(MB).
142142

143-
- ``trace_memory_usage`` (default: False): When set to True, the library attempts
143+
- ``trace_memory_usage`` (default: False): When set to True, the library attempts
144144
to measure memory usage per module during tracing. If this is disabled,
145145
memory usage will be estimated through the sizes of tensors returned from
146146
the module.
147147

148-
- ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
148+
- ``broadcast_buffers`` (default: True): Flag to be used with ``ddp=True``.
149149
This parameter is forwarded to the underlying ``DistributedDataParallel`` wrapper.
150150
Please see: `broadcast_buffer <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
151151

152-
- ``gradient_as_bucket_view (PyTorch 1.7 only)`` (default: False): To be
152+
- ``gradient_as_bucket_view (PyTorch 1.7 only)`` (default: False): To be
153153
used with ``ddp=True``. This parameter is forwarded to the underlying
154154
``DistributedDataParallel`` wrapper. Please see `gradient_as_bucket_view <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
155155

src/sagemaker/estimator.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@
4949
UploadedCode,
5050
validate_source_dir,
5151
_region_supports_debugger,
52+
_region_supports_profiler,
5253
get_mp_parameters,
5354
)
5455
from sagemaker.inputs import TrainingInput
@@ -494,7 +495,7 @@ def _prepare_profiler_for_training(self):
494495
"""Set necessary values and do basic validations in profiler config and profiler rules.
495496
496497
When user explicitly set rules to an empty list, default profiler rule won't be enabled.
497-
Default profiler rule will be enabled when either:
498+
Default profiler rule will be enabled in supported regions when either:
498499
1. user doesn't specify any rules, i.e., rules=None; or
499500
2. user only specify debugger rules, i.e., rules=[Rule.sagemaker(...)]
500501
"""
@@ -503,7 +504,7 @@ def _prepare_profiler_for_training(self):
503504
raise RuntimeError("profiler_config cannot be set when disable_profiler is True.")
504505
if self.profiler_rules:
505506
raise RuntimeError("ProfilerRule cannot be set when disable_profiler is True.")
506-
elif _region_supports_debugger(self.sagemaker_session.boto_region_name):
507+
elif _region_supports_profiler(self.sagemaker_session.boto_region_name):
507508
if self.profiler_config is None:
508509
self.profiler_config = ProfilerConfig(s3_output_path=self.output_path)
509510
if self.rules is None or (self.rules and not self.profiler_rules):

src/sagemaker/fw_utils.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@
4949
)
5050

5151
DEBUGGER_UNSUPPORTED_REGIONS = ("us-iso-east-1",)
52+
PROFILER_UNSUPPORTED_REGIONS = ("us-iso-east-1", "cn-north-1", "cn-northwest-1")
53+
5254
SINGLE_GPU_INSTANCE_TYPES = ("ml.p2.xlarge", "ml.p3.2xlarge")
5355
SM_DATAPARALLEL_SUPPORTED_INSTANCE_TYPES = (
5456
"ml.p3.16xlarge",
@@ -550,6 +552,19 @@ def _region_supports_debugger(region_name):
550552
return region_name.lower() not in DEBUGGER_UNSUPPORTED_REGIONS
551553

552554

555+
def _region_supports_profiler(region_name):
556+
"""Returns bool indicating whether region supports Amazon SageMaker Debugger profiling feature.
557+
558+
Args:
559+
region_name (str): Name of the region to check against.
560+
561+
Returns:
562+
bool: Whether or not the region supports Amazon SageMaker Debugger profiling feature.
563+
564+
"""
565+
return region_name.lower() not in PROFILER_UNSUPPORTED_REGIONS
566+
567+
553568
def validate_version_or_image_args(framework_version, py_version, image_uri):
554569
"""Checks if version or image arguments are specified.
555570

src/sagemaker/processing.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@
3131
from sagemaker.session import Session
3232
from sagemaker.network import NetworkConfig # noqa: F401 # pylint: disable=unused-import
3333
from sagemaker.workflow.properties import Properties
34+
from sagemaker.workflow.entities import Expression
3435
from sagemaker.dataset_definition.inputs import S3Input, DatasetDefinition
3536
from sagemaker.apiutils._base_types import ApiObject
3637

@@ -338,6 +339,10 @@ def _normalize_outputs(self, outputs=None):
338339
# Generate a name for the ProcessingOutput if it doesn't have one.
339340
if output.output_name is None:
340341
output.output_name = "output-{}".format(count)
342+
# if the output's destination is a workflow expression, do no normalization
343+
if isinstance(output.destination, Expression):
344+
normalized_outputs.append(output)
345+
continue
341346
# If the output's destination is not an s3_uri, create one.
342347
parse_result = urlparse(output.destination)
343348
if parse_result.scheme != "s3":

src/sagemaker/workflow/functions.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License"). You
4+
# may not use this file except in compliance with the License. A copy of
5+
# the License is located at
6+
#
7+
# http://aws.amazon.com/apache2.0/
8+
#
9+
# or in the "license" file accompanying this file. This file is
10+
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
11+
# ANY KIND, either express or implied. See the License for the specific
12+
# language governing permissions and limitations under the License.
13+
"""The step definitions for workflow."""
14+
from __future__ import absolute_import
15+
16+
from typing import List
17+
18+
import attr
19+
20+
from sagemaker.workflow.entities import Expression
21+
22+
23+
@attr.s
24+
class Join(Expression):
25+
"""Join together properties.
26+
27+
Attributes:
28+
values (List[Union[PrimitiveType, Parameter]]): The primitive types
29+
and parameters to join.
30+
on_str (str): The string to join the values on (Defaults to "").
31+
"""
32+
33+
on: str = attr.ib(factory=str)
34+
values: List = attr.ib(factory=list)
35+
36+
@property
37+
def expr(self):
38+
"""The expression dict for a `Join` function."""
39+
return {
40+
"Std:Join": {
41+
"On": self.on,
42+
"Values": [
43+
value.expr if hasattr(value, "expr") else value for value in self.values
44+
],
45+
},
46+
}

tests/data/multimodel/container/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ RUN apt-get update && \
1515
curl \
1616
vim \
1717
&& rm -rf /var/lib/apt/lists/* \
18-
&& curl -O https://bootstrap.pypa.io/get-pip.py \
18+
&& curl -O https://bootstrap.pypa.io/3.5/get-pip.py \
1919
&& python3 get-pip.py
2020

2121
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 1

0 commit comments

Comments
 (0)