Skip to content

Commit b3660b9

Browse files
mchoi8739navinsonijeniyatqidewenwhenbencrabtree
authored
documentation: sagemaker distributed model parallel 1.7.0 doc (#2992)
* change: update code to get commit_id in codepipeline (#2961) * feature: Data Serializer (#2956) * change: reorganize test files for workflow (#2960) Co-authored-by: Ben Crabtree <[email protected]> Co-authored-by: Navin Soni <[email protected]> Co-authored-by: Jeniya Tabassum <[email protected]> Co-authored-by: Dewen Qi <[email protected]> * feature: TensorFlow 2.4 for Neo (#2861) Co-authored-by: Ben Crabtree <[email protected]> Co-authored-by: Navin Soni <[email protected]> Co-authored-by: Jeniya Tabassum <[email protected]> * fix: Remove sagemaker_job_name from hyperparameters in TrainingStep (#2950) Co-authored-by: Payton Staub <[email protected]> * fix: Style update in DataSerializer (#2962) * documentation: smddp doc update (#2968) * fix: container env generation for S3 URI and add test for the same (#2971) * documentation: update sagemaker training compiler docstring (#2969) * feat: Python 3.9 for readthedocs (#2973) Co-authored-by: Ben Crabtree <[email protected]> Co-authored-by: Navin Soni <[email protected]> Co-authored-by: Jeniya Tabassum <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Payton Staub <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: Qingzi-Lan <[email protected]> Co-authored-by: Payton Staub <[email protected]> * fix doc structure * archive 1.6.0 doc * add new args, refs, and links * fix version number * incorp eng feedback, update docstrings, improve xref * Trigger Build * minor fix, trigger build again * fix typo Co-authored-by: Navin Soni <[email protected]> Co-authored-by: Jeniya Tabassum <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: Ben Crabtree <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Qingzi-Lan <[email protected]> Co-authored-by: Payton Staub <[email protected]> Co-authored-by: Payton Staub <[email protected]> Co-authored-by: Shreya Pandit <[email protected]> Co-authored-by: Ahsan Khan <[email protected]> Co-authored-by: Mufaddal Rohawala <[email protected]>
1 parent 2eace35 commit b3660b9

12 files changed

+2360
-47
lines changed

doc/api/training/distributed.rst

+3
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,6 @@ The SageMaker Distributed Model Parallel Library
2626
:maxdepth: 3
2727

2828
smd_model_parallel
29+
smp_versions/latest
30+
smd_model_parallel_general
31+
smd_model_parallel_release_notes/smd_model_parallel_change_log

doc/api/training/smd_model_parallel.rst

-20
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,6 @@ allowing you to increase prediction accuracy by creating larger models with more
99
You can use the library to automatically partition your existing TensorFlow and PyTorch workloads
1010
across multiple GPUs with minimal code changes. The library's API can be accessed through the Amazon SageMaker SDK.
1111

12-
See the following sections to learn more about the SageMaker model parallel library APIs.
13-
14-
.. toctree::
15-
:maxdepth: 3
16-
17-
smp_versions/latest
18-
smd_model_parallel_general
19-
20-
2112
.. tip::
2213

2314
We recommended using this API documentation with the conceptual guide at
@@ -48,14 +39,3 @@ See the following sections to learn more about the SageMaker model parallel libr
4839
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
4940
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
5041
for more information.
51-
52-
Release Notes
53-
=============
54-
55-
New features, bug fixes, and improvements are regularly made to the SageMaker
56-
distributed model parallel library.
57-
58-
.. toctree::
59-
:maxdepth: 1
60-
61-
smd_model_parallel_release_notes/smd_model_parallel_change_log

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

+67-14
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,62 @@
1-
Sagemaker Distributed Model Parallel 1.6.0 Release Notes
1+
#############
2+
Release Notes
3+
#############
4+
5+
New features, bug fixes, and improvements are regularly made to the SageMaker
6+
distributed model parallel library.
7+
8+
SageMaker Distributed Model Parallel 1.7.0 Release Notes
29
========================================================
310

11+
*Date: March. 07. 2022*
12+
13+
**Currency Updates**
14+
15+
* Support for PyTorch 1.10.2
16+
* Support for Hugging Face Transformers 4.16.2
17+
18+
**Improvements**
19+
20+
* Additional support for the :ref:`smdmp-pytorch-tensor-parallel`.
21+
22+
* Added support for FP32 residual addition to avoid overflow (NaN loss values)
23+
for large models with more than 100 billion parameters when using FP16.
24+
This is integrated to the following module:
25+
26+
* :class:`smp.nn.DistributedTransformerOutputLayer`
27+
28+
29+
* Added support for the following two `NVIDIA Megatron fused kernels
30+
<https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:
31+
32+
* Fusion of attention masking and softmax (``fused_softmax``)
33+
* Fusion of bias addition and Gelu activation (``fused_bias_gelu``)
34+
35+
To learn more about these options and how to use them,
36+
see the :class:`smp.tensor_parallelism` context manager.
37+
38+
39+
40+
**Migration to AWS Deep Learning Containers**
41+
42+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
43+
44+
45+
* PyTorch 1.10.2
46+
47+
.. code::
48+
49+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
50+
51+
52+
----
53+
54+
Release History
55+
===============
56+
57+
SageMaker Distributed Model Parallel 1.6.0 Release Notes
58+
--------------------------------------------------------
59+
460
*Date: December. 20. 2021*
561

662
**New Features**
@@ -9,10 +65,10 @@ Sagemaker Distributed Model Parallel 1.6.0 Release Notes
965

1066
- Added extended memory-saving features for PyTorch 1.8.1:
1167

12-
- Tensor parallelism
13-
- Optimizer state sharding
14-
- Activation checkpointing
15-
- Activation offloading
68+
- `Tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html>`_
69+
- `Optimizer state sharding <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html>`_
70+
- `Activation checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
71+
- `Activation offloading <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html>`_
1672

1773
For more information, see the following documentation:
1874

@@ -30,12 +86,9 @@ AWS Deep Learning Container(s):
3086
3187
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
3288
33-
----
3489
35-
Release History
36-
===============
3790
38-
Sagemaker Distributed Model Parallel 1.5.0 Release Notes
91+
SageMaker Distributed Model Parallel 1.5.0 Release Notes
3992
--------------------------------------------------------
4093

4194
*Date: November. 03. 2021*
@@ -59,7 +112,7 @@ AWS Deep Learning Containers:
59112
60113
----
61114

62-
Sagemaker Distributed Model Parallel 1.4.0 Release Notes
115+
SageMaker Distributed Model Parallel 1.4.0 Release Notes
63116
--------------------------------------------------------
64117

65118
*Date: June. 29. 2021*
@@ -90,7 +143,7 @@ AWS Deep Learning Containers:
90143
91144
----
92145

93-
Sagemaker Distributed Model Parallel 1.3.1 Release Notes
146+
SageMaker Distributed Model Parallel 1.3.1 Release Notes
94147
--------------------------------------------------------
95148

96149
- New Features
@@ -143,7 +196,7 @@ Sagemaker Distributed Model Parallel 1.3.1 Release Notes
143196

144197
----
145198

146-
Sagemaker Distributed Model Parallel 1.3.0 Release Notes
199+
SageMaker Distributed Model Parallel 1.3.0 Release Notes
147200
--------------------------------------------------------
148201

149202
- New Features
@@ -235,7 +288,7 @@ Sagemaker Distributed Model Parallel 1.3.0 Release Notes
235288

236289
----
237290

238-
Sagemaker Distributed Model Parallel 1.2.0 Release Notes
291+
SageMaker Distributed Model Parallel 1.2.0 Release Notes
239292
--------------------------------------------------------
240293

241294
- New Features
@@ -312,7 +365,7 @@ Sagemaker Distributed Model Parallel 1.2.0 Release Notes
312365

313366
----
314367

315-
Sagemaker Distributed Model Parallel 1.1.0 Release Notes
368+
SageMaker Distributed Model Parallel 1.1.0 Release Notes
316369
--------------------------------------------------------
317370

318371
- New Features

doc/api/training/smp_versions/archives.rst

+1
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
.. toctree::
44
:maxdepth: 1
55

6+
v1_6_0.rst
67
v1_5_0.rst
78
v1_4_0.rst
89
v1_3_0.rst

doc/api/training/smp_versions/latest.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.6.0 (Latest)
13+
Version 1.7.0 (Latest)
1414
======================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ you need to add the following import statement at the top of your training scrip
1616
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
1717
to learn how to use the following API in your PyTorch training script.
1818

19-
.. py:class:: smp.DistributedModel()
19+
.. class:: smp.DistributedModel
2020

2121
A sub-class of ``torch.nn.Module`` which specifies the model to be
2222
partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst

+32-11
Original file line numberDiff line numberDiff line change
@@ -205,7 +205,7 @@ Tensor Parallelism Module APIs
205205
if ``add_lm_head`` is ``True``, the output passes through a single
206206
LM head, which is a linear module without bias whose weight is
207207
tied to the word embeddings.
208-
- See ``DistributedTransformerLayer`` for a description of the rest
208+
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the rest
209209
of the arguments.
210210
- **Methods:**
211211

@@ -344,11 +344,11 @@ Tensor Parallelism Module APIs
344344
followed by the residual-connection and layer normalization.
345345
- **Arguments:**
346346

347-
- See ``DistributedTransformerLayer`` for a description of the
347+
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
348348
arguments.
349-
- If ``cross_attention`` is ``True``, computes the attentions
349+
- ``cross_attention``: If ``True``, it computes the attentions
350350
with respect to the ``cross_states`` tensor of the ``forward``
351-
method input tuple.
351+
method input tuple. (Default: ``False``)
352352

353353
- **Methods:**
354354

@@ -363,7 +363,7 @@ Tensor Parallelism Module APIs
363363
``[N, S, H]``, where ``N`` is batch size, ``S`` is
364364
sequence length, and ``H`` is ``hidden_size``.
365365
``attention_mask`` is assumed to be a tensor of
366-
dimensions ``[N, 1, 1, S]``, \***\* where ``N`` is the
366+
dimensions ``[N, 1, 1, S]``, where ``N`` is the
367367
batch size, and ``S`` is the sequence length.
368368
- If ``cross_attention=True``, ``inputs`` must be a tuple
369369
``(hidden_states, cross_states, attention_mask)``, where
@@ -383,27 +383,30 @@ Tensor Parallelism Module APIs
383383
- A single tensor that is the output of the attention
384384
layer.
385385

386-
.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
386+
.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
387387

388388
- Distributed implementation of a single transformer output layer. A
389-
single ``DistributedTransformerLayer`` with
389+
single :class:`smp.nn.DistributedTransformerLayer` with
390390
``add_cross_attention=False`` consists of a single
391391
``DistributedAttentionLayer`` immediately followed by a single
392392
``DistributedTransformerOutputLayer``. The latter linearly maps
393393
the last channel of the input tensor from ``hidden_size`` to
394394
``intermediate_size``, and then maps it back to ``hidden_size``.
395395
- **Arguments:**
396396

397-
- See ``DistributedTransformerLayer`` for a description of the
397+
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
398398
arguments.
399+
- ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
400+
(NaN loss values) for large models with more than 100 billion parameters
401+
when using FP16. (Default: False)
399402

400403
.. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
401404

402405
- Distributed implementation of a single Embedding Layer. Currently
403406
only supports splitting across the embedding_dim.
404407
- **Arguments:**
405408

406-
- See ``DistributedEmbedding`` for a description of the
409+
- See :class:`smp.nn.DistributedEmbedding` for descriptions of the
407410
arguments.
408411

409412
.. _enabling-tp:
@@ -447,7 +450,7 @@ following API:
447450

448451
- A context manager that enables or disables tensor parallelism for
449452
any supported module that is created inside. If there are nested
450-
contexts, the innermost will override the rest. If there are
453+
contexts, the innermost overrides the rest. If there are
451454
multiple supported modules created within the context, where one
452455
is the submodule of the other, only the outermost module will be
453456
distributed. If a supported module shares weights with another
@@ -465,7 +468,25 @@ following API:
465468
with smp.tensor_parallelism(enabled=False):
466469
self.m1 = nn.Linear(20, 20) # will not be distributed
467470
468-
- Keyword arguments `kwargs` can be used to modify the configurations of the distributed modules created inside the context. If a keyword argument provided here matches any `__init__` method arguments of a `DistributedModule` that substitutes a module created inside the `smp.tensor_parallelism` context, this keyword will override the value defined in the `init_hook`.
471+
- ``kwargs`` - Keyword arguments that can be used to modify the configurations of
472+
the distributed modules created inside the context.
473+
If a keyword argument provided through it matches any ``__init__`` method arguments
474+
of a ``DistributedModule`` that substitutes a module created inside
475+
the ``smp.tensor_parallelism`` context, this keyword will override
476+
the value defined in the ``init_hook``.
477+
478+
- (*For v1.7.0 and later*) Through the following additional keyword arguments,
479+
the library supports `NVIDIA Megatron’s fused kernels
480+
<https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_
481+
482+
- ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
483+
By default, it is set to ``True``. You can deactivate it by setting
484+
``fused_softmax=False`` in the ``smp.tensor_parallelism`` context manager.
485+
- ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
486+
By default, it is set to ``False``. You can activate it by setting
487+
``fused_bias_gelu=True`` in the ``smp.tensor_parallelism`` context manager.
488+
489+
469490

470491
.. function:: smp.set_tensor_parallelism(module, enabled=True, **kwargs)
471492

0 commit comments

Comments
 (0)