Skip to content

documentation: the SM model parallel library v1.15.0 release note #3806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,88 @@ Release Notes
#############

New features, bug fixes, and improvements are regularly made to the SageMaker
distributed model parallel library.
model parallelism library.


SageMaker Distributed Model Parallel 1.14.0 Release Notes
SageMaker Distributed Model Parallel 1.15.0 Release Notes
=========================================================

*Date: Apr. 27. 2023*

**Currency Updates**

* Added support for PyTorch v2.0.0.
Note that the library does not support ``torch.compile`` in this release.

**New Features**

* Using sharded data parallelism with tensor parallelism together is now
available for PyTorch 1.13.1. It allows you to train with smaller global batch
sizes while scaling up to large clusters. For more information, see `Sharded
data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
in the *Amazon SageMaker Developer Guide*.
* Added support for saving and loading full model checkpoints when using sharded
data parallelism. This is enabled by using the standard checkpointing API,
``smp.save_checkpoint`` with ``partial=False``.
Before, full checkpoints needed to be created by merging partial checkpoint
files after training finishes.
* `DistributedTransformer <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.nn.DistributedTransformerLayer>`_
now supports the ALiBi position embeddings.
When using DistributedTransformer, you can set the ``use_alibi`` parameter
to ``True`` to use the Triton-based flash attention kernels. This helps
evaluate sequences longer than those used for training.

**Bug Fixes**

* When using tensor parallelism, parameters were initialized multiple times
unncessarily. This release fixed the multiple initialization of parameters
so that each parameter is initialized exactly once.
It not only saves time, but also ensures that the random generator behavior
is similar to the non-tensor parallelism case.

**Known issues**

* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.

**Migration to AWS Deep Learning Containers**

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

- SageMaker training container for PyTorch v2.0.0

.. code::

763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker

- SageMaker training container for PyTorch v1.13.1

.. code::

763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker

Binary file of this version of the library for `custom container
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:

- For PyTorch v2.0.0

.. code::

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl

- For PyTorch v1.13.1

.. code::

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl

----

Release History
===============

SageMaker Distributed Model Parallel 1.14.0 Release Notes
---------------------------------------------------------

*Date: Jan. 30. 2023*

**Currency Updates**
Expand Down Expand Up @@ -39,10 +115,6 @@ Binary file of this version of the library for `custom container

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl

----

Release History
===============

SageMaker Distributed Model Parallel 1.13.0 Release Notes
---------------------------------------------------------
Expand Down
4 changes: 2 additions & 2 deletions doc/api/training/smp_versions/latest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ depending on which version of the library you need to use.
To use the library, reference the
**Common API** documentation alongside the framework specific API documentation.

Version 1.11.0, 1.13.0, 1.14.0 (Latest)
=======================================
Version 1.11.0, 1.13.0, 1.14.0, 1.15.0 (Latest)
===============================================

To use the library, reference the Common API documentation alongside the framework specific API documentation.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,20 @@ Tensor Parallelism Module APIs
- ``post_layernorm``: If ``True``, inserts layer normalization at
the output. At least one of ``pre_layernorm`` and
``post_layernorm`` must be ``True``.
- ``use_alibi`` (bool, default False): Activates Attention with
Linear Biases (ALiBi) for attention computation.
ALiBi facilitates efficient extrapolation on input sequences
and thus improves training efficiency.
The library enables ALiBi by using the `Triton
flash attention kernel
<https://github.com/HazyResearch/flash-attention>`_.
Refer to https://arxiv.org/abs/2108.12409 for more
details on the technique.
(Available from
the SageMaker model parallelism library v1.15.0.)
- ``alibi_bias_max`` (int, default 8): Defines the ALiBi base
value for mask generation. (Available from
the SageMaker model parallelism library v1.15.0.)

- **Methods:**

Expand Down