Skip to content

documentation: smdmp 1.8.1 release note #3085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions doc/api/training/smd_model_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,11 @@ across multiple GPUs with minimal code changes. The library's API can be accesse

.. tip::

We recommended using this API documentation with the conceptual guide at
We recommend that you use this API documentation along with the conceptual guide at
`SageMaker's Distributed Model Parallel
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_
in the *Amazon SageMaker developer guide*. This developer guide documentation includes:
in the *Amazon SageMaker developer guide*.
The conceptual guide includes the following topics:

- An overview of model parallelism, and the library's
`core features <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html>`_,
Expand All @@ -32,10 +33,11 @@ across multiple GPUs with minimal code changes. The library's API can be accesse


.. important::
The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
it uses CUDA 11. When you extend or customize your own training image
you must use a CUDA 11 base image. See
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
for more information.
The model parallel library only supports SageMaker training jobs using CUDA 11.
Make sure you use the pre-built Deep Learning Containers.
If you want to extend or customize your own training image,
you must use a CUDA 11 base image. For more information, see `Extend a Prebuilt Docker
Container that Contains SageMaker's Distributed Model Parallel Library
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-customize-container>`_
and `Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_.
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,68 @@ Release Notes
New features, bug fixes, and improvements are regularly made to the SageMaker
distributed model parallel library.

SageMaker Distributed Model Parallel 1.8.0 Release Notes
SageMaker Distributed Model Parallel 1.8.1 Release Notes
========================================================

*Date: April. 23. 2022*

**New Features**

* Added support for more configurations of the Hugging Face Transformers GPT-2 and GPT-J models
with tensor parallelism: ``scale_attn_weights``, ``scale_attn_by_inverse_layer_idx``,
``reorder_and_upcast_attn``. To learn more about these features, please refer to
the following model configuration classes
in the *Hugging Face Transformers documentation*:

* `transformers.GPT2Config <https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2Config>`_
* `transformers.GPTJConfig <https://huggingface.co/docs/transformers/model_doc/gptj#transformers.GPTJConfig>`_

* Added support for activation checkpointing of modules which pass keyword value arguments
and arbitrary structures in their forward methods. This helps support
activation checkpointing with Hugging Face Transformers models even
when tensor parallelism is not enabled.

**Bug Fixes**

* Fixed a correctness issue with tensor parallelism for GPT-J model
which was due to improper scaling during gradient reduction
for some layer normalization modules.
* Fixed the creation of unnecessary additional processes which take up some
GPU memory on GPU 0 when the :class:`smp.allgather` collective is called.

**Improvements**

* Improved activation offloading so that activations are preloaded on a
per-layer basis as opposed to all activations for a micro batch earlier.
This not only improves memory efficiency and performance, but also makes
activation offloading a useful feature for non-pipeline parallelism cases.

**Migration to AWS Deep Learning Containers**

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:

* HuggingFace 4.17.0 DLC with PyTorch 1.10.2

.. code::

763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04


* The binary file of this version of the library for custom container users

.. code::

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-04-14-03-58/smdistributed_modelparallel-1.8.1-cp38-cp38-linux_x86_64.whl


----

Release History
===============

SageMaker Distributed Model Parallel 1.8.0 Release Notes
--------------------------------------------------------

*Date: March. 23. 2022*

**New Features**
Expand All @@ -32,18 +91,13 @@ This version passed benchmark testing and is migrated to the following AWS Deep
763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.10.2-transformers4.17.0-gpu-py38-cu113-ubuntu20.04


The binary file of this version of the library for custom container users:
* The binary file of this version of the library for custom container users

.. code::

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-03-12-00-33/smdistributed_modelparallel-1.8.0-cp38-cp38-linux_x86_64.whl


----

Release History
===============

SageMaker Distributed Model Parallel 1.7.0 Release Notes
--------------------------------------------------------

Expand Down
4 changes: 2 additions & 2 deletions doc/api/training/smp_versions/latest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ depending on which version of the library you need to use.
To use the library, reference the
**Common API** documentation alongside the framework specific API documentation.

Version 1.7.0, 1.8.0 (Latest)
=============================
Version 1.7.0, 1.8.0, 1.8.1 (Latest)
====================================

To use the library, reference the Common API documentation alongside the framework specific API documentation.

Expand Down