From c9db4ccae359b293c4e2b30b5af64e905c29db25 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Fri, 15 Jul 2022 16:11:46 -0700 Subject: [PATCH 1/5] add smdmp v1.10 release note --- .../smd_model_parallel_change_log.rst | 54 ++++++++++++++++--- 1 file changed, 48 insertions(+), 6 deletions(-) diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst index d65efd5022..3a1a4df0c3 100644 --- a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst +++ b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst @@ -5,14 +5,29 @@ Release Notes New features, bug fixes, and improvements are regularly made to the SageMaker distributed model parallel library. -SageMaker Distributed Model Parallel 1.9.0 Release Notes -======================================================== +SageMaker Distributed Model Parallel 1.10.0 Release Notes +========================================================= -*Date: May. 3. 2022* +*Date: July. 19. 2022* -**Currency Updates** +**New Features** -* Added support for PyTorch 1.11.0 +The following new features are added for PyTorch. + +* Added support for FP16 training by implementing smdistributed.modelparallel + modification of Apex FP16_Module and FP16_Optimizer. To learn more, see + ` `_. +* New checkpoint APIs for CPU memory usage optimization. To learn more, see + ` `_. + +**Improvements** + +* The SageMaker distributed model parallel library manages and optimizes CPU + memory by garbage-collecting non-local parameters in general and during checkpointing. +* Changes in the `GPT-2 translate functions + `_ + (``smdistributed.modelparallel.torch.nn.huggingface.gpt2``) + to save memory by not maintaining two copies of weights at the same time. **Migration to AWS Deep Learning Containers** @@ -28,7 +43,7 @@ Binary file of this version of the library for custom container users: .. code:: - https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-04-20-17-05/smdistributed_modelparallel-1.9.0-cp38-cp38-linux_x86_64.whl + https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl @@ -37,6 +52,33 @@ Binary file of this version of the library for custom container users: Release History =============== +SageMaker Distributed Model Parallel 1.9.0 Release Notes +-------------------------------------------------------- + +*Date: May. 3. 2022* + +**Currency Updates** + +* Added support for PyTorch 1.11.0 + +**Migration to AWS Deep Learning Containers** + +This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC): + +- PyTorch 1.11.0 DLC + + .. code:: + + 763104351884.dkr.ecr..amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker + +Binary file of this version of the library for custom container users: + + .. code:: + + https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-04-20-17-05/smdistributed_modelparallel-1.9.0-cp38-cp38-linux_x86_64.whl + + + SageMaker Distributed Model Parallel 1.8.1 Release Notes -------------------------------------------------------- From 9633c88512bbc709235f5f2b2817449bb9e08a8b Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Fri, 15 Jul 2022 16:14:06 -0700 Subject: [PATCH 2/5] add link title --- .../smd_model_parallel_change_log.rst | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst index 3a1a4df0c3..12ed10049a 100644 --- a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst +++ b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst @@ -16,9 +16,11 @@ The following new features are added for PyTorch. * Added support for FP16 training by implementing smdistributed.modelparallel modification of Apex FP16_Module and FP16_Optimizer. To learn more, see - ` `_. + `FP16 Training with Model Parallelism + `_. * New checkpoint APIs for CPU memory usage optimization. To learn more, see - ` `_. + `Checkpointing Distributed Models and Optimizer States + `_. **Improvements** From b1c4a106b9436ca0b8eb8341347120ddfffc7732 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 18 Jul 2022 11:31:33 -0700 Subject: [PATCH 3/5] Trigger Build From 2e5c19be0ab42a6f9358fcfbe38ad3d4b636f3d6 Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Thu, 21 Jul 2022 11:15:30 -0700 Subject: [PATCH 4/5] Trigger Build From ba21266fcd0cf038f40156f2180fccbb9bd6b6ac Mon Sep 17 00:00:00 2001 From: Miyoung Choi Date: Mon, 25 Jul 2022 10:21:21 -0700 Subject: [PATCH 5/5] Trigger Build