documentation: SM model parallel library v1.15.0 release note (#3806)

mchoi8739 · web-flow · commit 0eade552af2e · 2023-05-01T10:09:51.000-07:00
diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst
@@ -3,12 +3,88 @@ Release Notes
 #############
 
 New features, bug fixes, and improvements are regularly made to the SageMaker
-distributed model parallel library.
+model parallelism library.
 
 
-SageMaker Distributed Model Parallel 1.14.0 Release Notes
+SageMaker Distributed Model Parallel 1.15.0 Release Notes
 =========================================================
 
+*Date: Apr. 27. 2023*
+
+**Currency Updates**
+
+* Added support for PyTorch v2.0.0.
+  Note that the library does not support ``torch.compile`` in this release.
+
+**New Features**
+
+* Using sharded data parallelism with tensor parallelism together is now
+  available for PyTorch 1.13.1. It allows you to train with smaller global batch
+  sizes while scaling up to large clusters. For more information, see `Sharded
+  data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
+  in the *Amazon SageMaker Developer Guide*.
+* Added support for saving and loading full model checkpoints when using sharded
+  data parallelism. This is enabled by using the standard checkpointing API,
+  ``smp.save_checkpoint`` with ``partial=False``.
+  Before, full checkpoints needed to be created by merging partial checkpoint
+  files after training finishes.
+* `DistributedTransformer <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.nn.DistributedTransformerLayer>`_
+  now supports the ALiBi position embeddings.
+  When using DistributedTransformer, you can set the ``use_alibi`` parameter
+  to ``True`` to use the Triton-based flash attention kernels. This helps
+  evaluate sequences longer than those used for training.
+
+**Bug Fixes**
+
+* When using tensor parallelism, parameters were initialized multiple times
+  unncessarily. This release fixed the multiple initialization of parameters
+  so that each parameter is initialized exactly once.
+  It not only saves time, but also ensures that the random generator behavior
+  is similar to the non-tensor parallelism case.
+
+**Known issues**
+
+* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
+
+**Migration to AWS Deep Learning Containers**
+
+This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
+
+- SageMaker training container for PyTorch v2.0.0
+
+  .. code::
+
+    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
+
+- SageMaker training container for PyTorch v1.13.1
+
+  .. code::
+
+    763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
+
+Binary file of this version of the library for `custom container
+<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
+
+- For PyTorch v2.0.0
+
+  .. code::
+
+    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl
+
+- For PyTorch v1.13.1
+
+  .. code::
+
+    https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl
+
+----
+
+Release History
+===============
+
+SageMaker Distributed Model Parallel 1.14.0 Release Notes
+---------------------------------------------------------
+
 *Date: Jan. 30. 2023*
 
 **Currency Updates**
@@ -39,10 +115,6 @@ Binary file of this version of the library for `custom container
 
     https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
 
-----
-
-Release History
-===============
 
 SageMaker Distributed Model Parallel 1.13.0 Release Notes
 ---------------------------------------------------------
diff --git a/doc/api/training/smp_versions/latest.rst b/doc/api/training/smp_versions/latest.rst
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
 To use the library, reference the
 **Common API** documentation alongside the framework specific API documentation.
 
-Version 1.11.0, 1.13.0, 1.14.0 (Latest)
-=======================================
+Version 1.11.0, 1.13.0, 1.14.0, 1.15.0 (Latest)
+===============================================
 
 To use the library, reference the Common API documentation alongside the framework specific API documentation.
 
diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst
@@ -302,6 +302,20 @@ Tensor Parallelism Module APIs
       -  ``post_layernorm``: If ``True``, inserts layer normalization at
          the output. At least one of ``pre_layernorm`` and
          ``post_layernorm`` must be ``True``.
+      -  ``use_alibi`` (bool, default False): Activates Attention with
+         Linear Biases (ALiBi) for attention computation.
+         ALiBi facilitates efficient extrapolation on input sequences
+         and thus improves training efficiency.
+         The library enables ALiBi by using the `Triton
+         flash attention kernel
+         <https://github.com/HazyResearch/flash-attention>`_.
+         Refer to https://arxiv.org/abs/2108.12409 for more
+         details on the technique.
+         (Available from
+         the SageMaker model parallelism library v1.15.0.)
+      -  ``alibi_bias_max`` (int, default 8): Defines the ALiBi base
+         value for mask generation. (Available from
+         the SageMaker model parallelism library v1.15.0.)
 
    -  **Methods:**