Skip to content

documentation: smdistributed libraries release notes #3543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 15, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions doc/api/training/sdp_versions/latest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ depending on the version of the library you use.
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
for more information.

Version 1.4.0, 1.4.1, 1.5.0 (Latest)
====================================
Version 1.4.0, 1.4.1, 1.5.0, 1.6.0 (Latest)
===========================================

.. toctree::
:maxdepth: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,47 @@ Release Notes
New features, bug fixes, and improvements are regularly made to the SageMaker
distributed data parallel library.

SageMaker Distributed Data Parallel 1.5.0 Release Notes
SageMaker Distributed Data Parallel 1.6.0 Release Notes
=======================================================

*Date: Dec. 15. 2022*

**New Features**

* New SMDDP Collectives support for the SageMaker model parallelism library’s sharded data parallelism operating AllGather. For more information, see `Sharded data parallelism with SMDDP Collectives <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives>`_ in the Amazon SageMaker Developer Guide.
* Added support for Amazon EC2 ml.p4de.24xlarge instances.

**Improvements*

* Improved general performance of the SMDDP AllReduce collective communication operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanna check the wording - "general performance" vs "general improvements" is different, @Arjunbala?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arjun first suggested "General performance improvements to SMDDP All-Reduce collective". Let me know if we just want to keep it as he suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to just directly use Arjun's suggestion right now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's both, so we're good :)


**Migration to AWS Deep Learning Containers**

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

- SageMaker training container for PyTorch v1.12.1

.. code::

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker


Binary file of this version of the library for `custom container
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-bring-your-own-container>`_ users:

.. code::

https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed_dataparallel-1.6.0-cp38-cp38-linux_x86_64.whl


----

Release History
===============

SageMaker Distributed Data Parallel 1.5.0 Release Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*Date: Jul. 26. 2022*

**Currency Updates**
Expand Down Expand Up @@ -38,12 +76,6 @@ Binary file of this version of the library for `custom container

https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed_dataparallel-1.5.0-cp38-cp38-linux_x86_64.whl


----

Release History
===============

SageMaker Distributed Data Parallel 1.4.1 Release Notes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,61 @@ New features, bug fixes, and improvements are regularly made to the SageMaker
distributed model parallel library.


SageMaker Distributed Model Parallel 1.11.0 Release Notes
SageMaker Distributed Model Parallel 1.13.0 Release Notes
=========================================================

*Date: Dec. 15. 2022*

**New Features**

* Sharded data parallelism now supports a new backend for collectives, SMDDP. For supported scenarios
this is used by default for AllGather. For more information, see
`Sharded data parallelism with SMDDP Collectives
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives>`_
in the Amazon SageMaker Developer Guide.
* Introduced FlashAttention for DistributedTransformer to improve memory usage and computational
performance of models such as GPT2, GPTNeo, GPTJ, GPTNeoX, BERT, and RoBERTa.

**Bug Fixes**

* Fixed initialization of lm_head in DistributedTransformer to use a provided range
for initialization, when weights are not tied with the embeddings.

**Improvements**

* When a module has no parameters, we have introduced an optimization to execute
such a module on the same rank as its parent during pipeline parallelism.



**Migration to AWS Deep Learning Containers**

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

- SageMaker training container for PyTorch v1.12.1

.. code::
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
Binary file of this version of the library for `custom container
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:

- For PyTorch 1.12.0

.. code::
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed_modelparallel-1.13.0-cp38-cp38-linux_x86_64.whl
----

Release History
===============

SageMaker Distributed Model Parallel 1.11.0 Release Notes
---------------------------------------------------------

*Date: August. 17. 2022*

**New Features**
Expand Down Expand Up @@ -41,12 +93,7 @@ Binary file of this version of the library for `custom container

.. code::
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl
----

Release History
===============
https://sagemaker-distribu
SageMaker Distributed Model Parallel 1.10.1 Release Notes
---------------------------------------------------------
Expand Down
4 changes: 2 additions & 2 deletions doc/api/training/smp_versions/latest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ depending on which version of the library you need to use.
To use the library, reference the
**Common API** documentation alongside the framework specific API documentation.

Version 1.11.0 (Latest)
===========================================
Version 1.11.0, 1.13.0 (Latest)
===============================

To use the library, reference the Common API documentation alongside the framework specific API documentation.

Expand Down