Skip to content

Commit a9cfab4

Browse files
committed
Merge branch 'master' into accept-step-object-in-dependson-list
2 parents 5bc47bd + e25d36c commit a9cfab4

19 files changed

+2275
-312
lines changed

CHANGELOG.md

+7
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Changelog
22

3+
## v2.47.2.post0 (2021-07-01)
4+
5+
### Documentation Changes
6+
7+
* smddp 1.2.1 release note / convert md to rst
8+
* add smd model parallel 1.4.0 release note / restructure doc files
9+
310
## v2.47.2 (2021-06-30)
411

512
### Bug Fixes and Other Changes

doc/api/training/sdp_versions/latest.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
Version 1.2.0 (Latest)
2+
Version 1.2.x (Latest)
33
======================
44

55
.. toctree::

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ TensorFlow API
157157

158158
.. rubric:: Supported versions
159159

160-
**TensorFlow 2.3.1, 2.4.1**
160+
**TensorFlow 2.3.1, 2.4.1, 2.5.0**
161161

162162
.. function:: smdistributed.dataparallel.tensorflow.init()
163163

doc/api/training/smd_data_parallel.rst

+6-4
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,10 @@ Select a version to see the API documentation for version.
101101
Release Notes
102102
=============
103103

104-
New features, bug fixes, and improvements are regularly made to the SageMaker distributed data parallel library.
104+
New features, bug fixes, and improvements are regularly made to the SageMaker
105+
distributed data parallel library.
105106

106-
To see the the latest changes made to the library, refer to the library
107-
`Release Notes
108-
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_data_parallel_release_notes/>`_.
107+
.. toctree::
108+
:maxdepth: 1
109+
110+
smd_data_parallel_release_notes/smd_data_parallel_change_log

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md

-91
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
Sagemaker Distributed Data Parallel 1.2.1 Release Notes
2+
=======================================================
3+
4+
*Date: June. 29. 2021*
5+
6+
**New Features:**
7+
8+
- Added support for TensorFlow 2.5.0.
9+
10+
**Improvements**
11+
12+
- Improved performance on a single node and small clusters (2-4 nodes).
13+
14+
**Bug fixes**
15+
16+
- Enable ``sparse_as_dense`` by default for SageMaker distributed data
17+
parallel library for TensorFlow APIs: ``DistributedGradientTape`` and
18+
``DistributedOptimizer``.
19+
20+
**Migration to AWS Deep Learning Containers**
21+
22+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
23+
24+
- TensorFlow 2.5.0 DLC release: `v1.0-tf-2.5.0-tr-py37
25+
<https://github.com/aws/deep-learning-containers/releases/tag/v1.0-tf-2.5.0-tr-py37>`__
26+
27+
.. code::
28+
29+
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
30+
31+
----
32+
33+
Release History
34+
===============
35+
36+
Sagemaker Distributed Data Parallel 1.2.0 Release Notes
37+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38+
39+
- New features
40+
- Bug Fixes
41+
42+
**New features:**
43+
44+
- Support of `EFA network
45+
interface <https://aws.amazon.com/hpc/efa/>`__ for distributed
46+
AllReduce. For best performance, it is recommended you use an
47+
instance type that supports Amazon Elastic Fabric Adapter
48+
(ml.p3dn.24xlarge and ml.p4d.24xlarge) when you train a model using
49+
Sagemaker Distributed data parallel.
50+
51+
**Bug Fixes:**
52+
53+
- Improved performance on single node and small clusters.
54+
55+
----
56+
57+
Sagemaker Distributed Data Parallel 1.1.2 Release Notes
58+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59+
60+
- Bug Fixes
61+
- Known Issues
62+
63+
**Bug Fixes:**
64+
65+
- Fixed a bug that caused some TensorFlow operations to not work with
66+
certain data types. Operations forwarded from C++ have been extended
67+
to support every dtype supported by NCCL.
68+
69+
**Known Issues:**
70+
71+
- Sagemaker Distributed data parallel has slower throughput than NCCL
72+
when run using a single node. For the best performance, use
73+
multi-node distributed training with smdistributed.dataparallel. Use
74+
a single node only for experimental runs while preparing your
75+
training pipeline.
76+
77+
----
78+
79+
Sagemaker Distributed Data Parallel 1.1.1 Release Notes
80+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
81+
82+
- New Features
83+
- Bug Fixes
84+
- Known Issues
85+
86+
**New Features:**
87+
88+
- Adds support for PyTorch 1.8.1
89+
90+
**Bug Fixes:**
91+
92+
- Fixes a bug that was causing gradients from one of the worker nodes
93+
to be added twice resulting in incorrect ``all_reduce`` results under
94+
some conditions.
95+
96+
**Known Issues:**
97+
98+
- SageMaker distributed data parallel still is not efficient when run
99+
using a single node. For the best performance, use multi-node
100+
distributed training with ``smdistributed.dataparallel``. Use a
101+
single node only for experimental runs while preparing your training
102+
pipeline.
103+
104+
----
105+
106+
Sagemaker Distributed Data Parallel 1.1.0 Release Notes
107+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
108+
109+
- New Features
110+
- Bug Fixes
111+
- Improvements
112+
- Known Issues
113+
114+
**New Features:**
115+
116+
- Adds support for PyTorch 1.8.0 with CUDA 11.1 and CUDNN 8
117+
118+
**Bug Fixes:**
119+
120+
- Fixes crash issue when importing ``smdataparallel`` before PyTorch
121+
122+
**Improvements:**
123+
124+
- Update ``smdataparallel`` name in python packages, descriptions, and
125+
log outputs
126+
127+
**Known Issues:**
128+
129+
- SageMaker DataParallel is not efficient when run using a single node.
130+
For the best performance, use multi-node distributed training with
131+
``smdataparallel``. Use a single node only for experimental runs
132+
while preparing your training pipeline.
133+
134+
Getting Started
135+
136+
For getting started, refer to SageMaker Distributed Data Parallel Python
137+
SDK Guide
138+
(https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api).
139+
140+
----
141+
142+
Sagemaker Distributed Data Parallel 1.0.0 Release Notes
143+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
144+
145+
- First Release
146+
- Getting Started
147+
148+
First Release
149+
-------------
150+
151+
SageMaker’s distributed data parallel library extends SageMaker’s
152+
training capabilities on deep learning models with near-linear scaling
153+
efficiency, achieving fast time-to-train with minimal code changes.
154+
SageMaker Distributed Data Parallel:
155+
156+
- optimizes your training job for AWS network infrastructure and EC2
157+
instance topology.
158+
- takes advantage of gradient update to communicate between nodes with
159+
a custom AllReduce algorithm.
160+
161+
The library currently supports TensorFlow v2 and PyTorch via `AWS Deep
162+
Learning
163+
Containers <https://aws.amazon.com/machine-learning/containers/>`__.
164+
165+
Getting Started
166+
---------------
167+
168+
For getting started, refer to `SageMaker Distributed Data Parallel
169+
Python SDK
170+
Guide <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__.

doc/api/training/smd_model_parallel.rst

+7-4
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ Select a version to see the API documentation for version. To use the library, r
3535
:maxdepth: 1
3636

3737
smp_versions/latest.rst
38+
smp_versions/v1_3_0.rst
3839
smp_versions/v1_2_0.rst
3940
smp_versions/v1_1_0.rst
4041

@@ -65,8 +66,10 @@ developer guide. This developer guide documentation includes:
6566
Release Notes
6667
=============
6768

68-
New features, bug fixes, and improvements are regularly made to the SageMaker distributed model parallel library.
69-
To see the the latest changes made to the library, refer to the library
70-
`Release Notes
71-
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_model_parallel_release_notes/>`_.
69+
New features, bug fixes, and improvements are regularly made to the SageMaker
70+
distributed model parallel library.
7271

72+
.. toctree::
73+
:maxdepth: 1
74+
75+
smd_model_parallel_release_notes/smd_model_parallel_change_log

0 commit comments

Comments
 (0)