Skip to content

Commit 3694bef

Browse files
mchoi8739shreyapandit
authored andcommitted
documentation: smdmp 1.8.1 release note (aws#3085)
Co-authored-by: Shreya Pandit <[email protected]>
1 parent 4388782 commit 3694bef

File tree

3 files changed

+74
-18
lines changed

3 files changed

+74
-18
lines changed

doc/api/training/smd_model_parallel.rst

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,11 @@ across multiple GPUs with minimal code changes. The library's API can be accesse
1111

1212
.. tip::
1313

14-
We recommended using this API documentation with the conceptual guide at
14+
We recommend that you use this API documentation along with the conceptual guide at
1515
`SageMaker's Distributed Model Parallel
1616
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_
17-
in the *Amazon SageMaker developer guide*. This developer guide documentation includes:
17+
in the *Amazon SageMaker developer guide*.
18+
The conceptual guide includes the following topics:
1819

1920
- An overview of model parallelism, and the library's
2021
`core features <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html>`_,
@@ -32,10 +33,11 @@ across multiple GPUs with minimal code changes. The library's API can be accesse
3233

3334

3435
.. important::
35-
The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
36-
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
37-
it uses CUDA 11. When you extend or customize your own training image
38-
you must use a CUDA 11 base image. See
39-
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
40-
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
41-
for more information.
36+
The model parallel library only supports SageMaker training jobs using CUDA 11.
37+
Make sure you use the pre-built Deep Learning Containers.
38+
If you want to extend or customize your own training image,
39+
you must use a CUDA 11 base image. For more information, see `Extend a Prebuilt Docker
40+
Container that Contains SageMaker's Distributed Model Parallel Library
41+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-customize-container>`_
42+
and `Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
43+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_.

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

Lines changed: 61 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,68 @@ Release Notes
55
New features, bug fixes, and improvements are regularly made to the SageMaker
66
distributed model parallel library.
77

8-
SageMaker Distributed Model Parallel 1.8.0 Release Notes
8+
SageMaker Distributed Model Parallel 1.8.1 Release Notes
99
========================================================
1010

11+
*Date: April. 23. 2022*
12+
13+
**New Features**
14+
15+
* Added support for more configurations of the Hugging Face Transformers GPT-2 and GPT-J models
16+
with tensor parallelism: ``scale_attn_weights``, ``scale_attn_by_inverse_layer_idx``,
17+
``reorder_and_upcast_attn``. To learn more about these features, please refer to
18+
the following model configuration classes
19+
in the *Hugging Face Transformers documentation*:
20+
21+
* `transformers.GPT2Config <https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Config>`_
22+
* `transformers.GPTJConfig <https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJConfig>`_
23+
24+
* Added support for activation checkpointing of modules which pass keyword value arguments
25+
and arbitrary structures in their forward methods. This helps support
26+
activation checkpointing with Hugging Face Transformers models even
27+
when tensor parallelism is not enabled.
28+
29+
**Bug Fixes**
30+
31+
* Fixed a correctness issue with tensor parallelism for GPT-J model
32+
which was due to improper scaling during gradient reduction
33+
for some layer normalization modules.
34+
* Fixed the creation of unnecessary additional processes which take up some
35+
GPU memory on GPU 0 when the :class:`smp.allgather` collective is called.
36+
37+
**Improvements**
38+
39+
* Improved activation offloading so that activations are preloaded on a
40+
per-layer basis as opposed to all activations for a micro batch earlier.
41+
This not only improves memory efficiency and performance, but also makes
42+
activation offloading a useful feature for non-pipeline parallelism cases.
43+
44+
**Migration to AWS Deep Learning Containers**
45+
46+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
47+
48+
* HuggingFace 4.17.0 DLC with PyTorch 1.10.2
49+
50+
.. code::
51+
52+
763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
53+
54+
55+
* The binary file of this version of the library for custom container users
56+
57+
.. code::
58+
59+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-04-14-03-58/smdistributed_modelparallel-1.8.1-cp38-cp38-linux_x86_64.whl
60+
61+
62+
----
63+
64+
Release History
65+
===============
66+
67+
SageMaker Distributed Model Parallel 1.8.0 Release Notes
68+
--------------------------------------------------------
69+
1170
*Date: March. 23. 2022*
1271

1372
**New Features**
@@ -32,18 +91,13 @@ This version passed benchmark testing and is migrated to the following AWS Deep
3291
763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04
3392
3493
35-
The binary file of this version of the library for custom container users:
94+
* The binary file of this version of the library for custom container users
3695

3796
.. code::
3897
3998
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-03-12-00-33/smdistributed_modelparallel-1.8.0-cp38-cp38-linux_x86_64.whl
4099
41100
42-
----
43-
44-
Release History
45-
===============
46-
47101
SageMaker Distributed Model Parallel 1.7.0 Release Notes
48102
--------------------------------------------------------
49103

doc/api/training/smp_versions/latest.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.7.0, 1.8.0 (Latest)
14-
=============================
13+
Version 1.7.0, 1.8.0, 1.8.1 (Latest)
14+
====================================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.
1717

0 commit comments

Comments
 (0)