Skip to content

Commit e06dc26

Browse files
mchoi8739staubhpPayton Staubahsan-z-khanmufaddal-rohawala
authored andcommitted
documentation :SageMaker model parallel library 1.6.0 API doc (aws#2814)
* update smdmp change log, archive api doc for 1.4.0 and 1.5.0 * add no-index flags * finish api doc archive * fix: Set ProcessingStep upload locations deterministically to avoid c… (aws#2790) * fix: Prevent repack_model script from referencing nonexistent directories (aws#2755) Co-authored-by: Payton Staub <[email protected]> Co-authored-by: Ahsan Khan <[email protected]> * fix: S3Input - add support for instance attributes (aws#2754) * fix: typos and broken link (aws#2765) Co-authored-by: Shreya Pandit <[email protected]> * add all api docs * add appendix, fix links * structural changes, fix links * incorporate feedback * prepare release v2.72.1 * update development version to v2.72.2.dev0 Co-authored-by: Payton Staub <[email protected]> Co-authored-by: Payton Staub <[email protected]> Co-authored-by: Ahsan Khan <[email protected]> Co-authored-by: Mufaddal Rohawala <[email protected]> Co-authored-by: Mohamed Ali Jamaoui <[email protected]> Co-authored-by: Shreya Pandit <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Jeniya Tabassum <[email protected]>
1 parent 501c975 commit e06dc26

18 files changed

+3979
-442
lines changed

doc/api/training/smd_data_parallel.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
##########################
2-
Distributed data parallel
3-
##########################
1+
###############################################
2+
The SageMaker Distributed Data Parallel Library
3+
###############################################
44

55
SageMaker's distributed data parallel library extends SageMaker’s training
66
capabilities on deep learning models with near-linear scaling efficiency,

doc/api/training/smd_model_parallel.rst

+25-39
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Distributed model parallel
2-
--------------------------
1+
The SageMaker Distributed Model Parallel Library
2+
------------------------------------------------
33

44
The Amazon SageMaker distributed model parallel library is a model parallelism library for training
55
large deep learning models that were previously difficult to train due to GPU memory limitations.
@@ -9,49 +9,35 @@ allowing you to increase prediction accuracy by creating larger models with more
99
You can use the library to automatically partition your existing TensorFlow and PyTorch workloads
1010
across multiple GPUs with minimal code changes. The library's API can be accessed through the Amazon SageMaker SDK.
1111

12-
Use the following sections to learn more about the model parallelism and the library.
13-
14-
Use with the SageMaker Python SDK
15-
=================================
16-
17-
Use the following page to learn how to configure and enable distributed model parallel
18-
when you configure an Amazon SageMaker Python SDK `Estimator`.
12+
See the following sections to learn more about the SageMaker model parallel library APIs.
1913

2014
.. toctree::
21-
:maxdepth: 1
15+
:maxdepth: 3
2216

17+
smp_versions/latest
2318
smd_model_parallel_general
2419

25-
API Documentation
26-
=================
27-
28-
The library contains a Common API that is shared across frameworks, as well as APIs
29-
that are specific to supported frameworks, TensorFlow and PyTorch.
30-
31-
Select a version to see the API documentation for version. To use the library, reference the
32-
**Common API** documentation alongside the framework specific API documentation.
33-
34-
.. toctree::
35-
:maxdepth: 1
36-
37-
smp_versions/latest.rst
38-
smp_versions/v1_3_0.rst
39-
smp_versions/v1_2_0.rst
40-
smp_versions/v1_1_0.rst
41-
42-
It is recommended to use this documentation alongside `SageMaker Distributed Model Parallel
43-
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`__ in the Amazon SageMaker
44-
developer guide. This developer guide documentation includes:
4520

46-
- An overview of model parallelism and the library
47-
`core features <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html>`__
48-
- Instructions on how to modify `TensorFlow
49-
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-tf>`__
50-
and `PyTorch
51-
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt>`__
52-
training scripts
53-
- `Configuration tips and pitfalls
54-
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`__
21+
.. tip::
22+
23+
We recommended using this API documentation with the conceptual guide at
24+
`SageMaker's Distributed Model Parallel
25+
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_
26+
in the *Amazon SageMaker developer guide*. This developer guide documentation includes:
27+
28+
- An overview of model parallelism, and the library's
29+
`core features <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html>`_,
30+
and `extended features for PyTorch <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html>`_.
31+
- Instructions on how to modify `TensorFlow
32+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-tf.html>`_
33+
and `PyTorch
34+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
35+
training scripts.
36+
- Instructions on how to `run a distributed training job using the SageMaker Python SDK
37+
and the SageMaker model parallel library
38+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html>`_.
39+
- `Configuration tips and pitfalls
40+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`_.
5541

5642

5743
.. important::

doc/api/training/smd_model_parallel_general.rst

+333-350
Large diffs are not rendered by default.

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

+69-6
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,67 @@
1-
Sagemaker Distributed Model Parallel 1.4.0 Release Notes
1+
Sagemaker Distributed Model Parallel 1.6.0 Release Notes
22
========================================================
33

4+
*Date: December. 20. 2021*
5+
6+
**New Features**
7+
8+
- **PyTorch**
9+
10+
- Added extended memory-saving features for PyTorch 1.8.1:
11+
12+
- Tensor parallelism
13+
- Optimizer state sharding
14+
- Activation checkpointing
15+
- Activation offloading
16+
17+
For more information, see the following documentation:
18+
19+
- `SageMaker distributed model parallel developer guide <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch.html>`_
20+
- `SageMaker distributed model parallel API documentation for v1.6.0 <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest.html>`_
21+
22+
**Migration to AWS Deep Learning Containers**
23+
24+
This version passed benchmark testing and is migrated to the following
25+
AWS Deep Learning Container(s):
26+
27+
- Deep Learning Container for PyTorch 1.8.1:
28+
29+
.. code::
30+
31+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
32+
33+
----
34+
35+
Release History
36+
===============
37+
38+
Sagemaker Distributed Model Parallel 1.5.0 Release Notes
39+
--------------------------------------------------------
40+
41+
*Date: November. 03. 2021*
42+
43+
**New Features**
44+
45+
- **PyTorch**
46+
47+
- Currency update for PyTorch 1.10.0
48+
49+
**Migration to AWS Deep Learning Containers**
50+
51+
This version passed benchmark testing and is migrated to the following
52+
AWS Deep Learning Containers:
53+
54+
- Deep Learning Container for PyTorch 1.10.0:
55+
56+
.. code::
57+
58+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker
59+
60+
----
61+
62+
Sagemaker Distributed Model Parallel 1.4.0 Release Notes
63+
--------------------------------------------------------
64+
465
*Date: June. 29. 2021*
566

667
**New Features**
@@ -15,17 +76,19 @@ Sagemaker Distributed Model Parallel 1.4.0 Release Notes
1576
This version passed benchmark testing and is migrated to the following
1677
AWS Deep Learning Containers:
1778

18-
- TensorFlow 2.5.0 DLC release: `v1.0-tf-2.5.0-tr-py37
19-
<https://github.com/aws/deep-learning-containers/releases/tag/v1.0-tf-2.5.0-tr-py37>`__
79+
- Deep Learning Container for TensorFlow 2.5.0:
2080

2181
.. code::
2282
2383
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
2484
25-
----
85+
- Deep Learning Container for PyTorch 1.9.1:
2686

27-
Release History
28-
===============
87+
.. code::
88+
89+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04
90+
91+
----
2992

3093
Sagemaker Distributed Model Parallel 1.3.1 Release Notes
3194
--------------------------------------------------------
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
.. _smdmp-pt-version-archive:
2+
3+
.. toctree::
4+
:maxdepth: 1
5+
6+
v1_5_0.rst
7+
v1_4_0.rst
8+
v1_3_0.rst
9+
v1_2_0.rst
10+
v1_1_0.rst

doc/api/training/smp_versions/latest.rst

+25-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,16 @@
1+
###############################################
2+
Use the Library's API to Adapt Training Scripts
3+
###############################################
14

2-
Version 1.4.0 (Latest)
5+
The library provides Common APIs that you can use across frameworks,
6+
as well as framework-specific APIs for TensorFlow and PyTorch.
7+
8+
Select the latest or one of the previous versions of the API documentation
9+
depending on which version of the library you need to use.
10+
To use the library, reference the
11+
**Common API** documentation alongside the framework specific API documentation.
12+
13+
Version 1.6.0 (Latest)
314
======================
415

516
To use the library, reference the Common API documentation alongside the framework specific API documentation.
@@ -9,4 +20,17 @@ To use the library, reference the Common API documentation alongside the framewo
920

1021
latest/smd_model_parallel_common_api
1122
latest/smd_model_parallel_pytorch
23+
latest/smd_model_parallel_pytorch_tensor_parallel
1224
latest/smd_model_parallel_tensorflow
25+
26+
To find archived API documentation for the previous versions of the library,
27+
see the following link:
28+
29+
30+
Documentation Archive
31+
=====================
32+
33+
.. toctree::
34+
:maxdepth: 1
35+
36+
archives

doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst

+75-25
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
.. admonition:: Contents
2-
3-
- :ref:`communication_api`
4-
- :ref:`mpi_basics`
5-
61
Common API
72
==========
83

94
The following SageMaker distribute model parallel APIs are common across all frameworks.
105

11-
**Important**: This API document assumes you use the following import statement in your training scripts.
6+
.. contents:: Table of Contents
7+
:depth: 3
8+
:local:
9+
10+
The Library's Core APIs
11+
-----------------------
12+
13+
This API document assumes you use the following import statement in your training scripts.
1214

1315
**TensorFlow**
1416

@@ -254,30 +256,78 @@ The following SageMaker distribute model parallel APIs are common across all fra
254256
.. _mpi_basics:
255257

256258
MPI Basics
257-
^^^^^^^^^^
259+
----------
258260

259261
The library exposes the following basic MPI primitives to its Python API:
260262

261-
- ``smp.rank()``: The rank of the current process.
262-
- ``smp.size()``: The total number of processes.
263-
- ``smp.mp_rank()``: The rank of the process among the processes that
264-
hold the current model replica.
265-
- ``smp.dp_rank()``: The rank of the process among the processes that
266-
hold different replicas of the same model partition.
267-
- ``smp.dp_size()``: The total number of model replicas.
268-
- ``smp.local_rank()``: The rank among the processes on the current
269-
instance.
270-
- ``smp.local_size()``: The total number of processes on the current
271-
instance.
272-
- ``smp.get_mp_group()``: The list of ranks over which the current
273-
model replica is partitioned.
274-
- ``smp.get_dp_group()``: The list of ranks that hold different
275-
replicas of the same model partition.
276-
277-
.. _communication_api:
263+
**Global**
264+
265+
- ``smp.rank()`` : The global rank of the current process.
266+
- ``smp.size()`` : The total number of processes.
267+
- ``smp.get_world_process_group()`` :
268+
``torch.distributed.ProcessGroup`` that contains all processes.
269+
- ``smp.CommGroup.WORLD``: The communication group corresponding to all processes.
270+
- ``smp.local_rank()``: The rank among the processes on the current instance.
271+
- ``smp.local_size()``: The total number of processes on the current instance.
272+
- ``smp.get_mp_group()``: The list of ranks over which the current model replica is partitioned.
273+
- ``smp.get_dp_group()``: The list of ranks that hold different replicas of the same model partition.
274+
275+
**Tensor Parallelism**
276+
277+
- ``smp.tp_rank()`` : The rank of the process within its
278+
tensor-parallelism group.
279+
- ``smp.tp_size()`` : The size of the tensor-parallelism group.
280+
- ``smp.get_tp_process_group()`` : Equivalent to
281+
``torch.distributed.ProcessGroup`` that contains the processes in the
282+
current tensor-parallelism group.
283+
- ``smp.CommGroup.TP_GROUP`` : The communication group corresponding to
284+
the current tensor parallelism group.
285+
286+
**Pipeline Parallelism**
287+
288+
- ``smp.pp_rank()`` : The rank of the process within its
289+
pipeline-parallelism group.
290+
- ``smp.pp_size()`` : The size of the pipeline-parallelism group.
291+
- ``smp.get_pp_process_group()`` : ``torch.distributed.ProcessGroup``
292+
that contains the processes in the current pipeline-parallelism group.
293+
- ``smp.CommGroup.PP_GROUP`` : The communication group corresponding to
294+
the current pipeline parallelism group.
295+
296+
**Reduced-Data Parallelism**
297+
298+
- ``smp.rdp_rank()`` : The rank of the process within its
299+
reduced-data-parallelism group.
300+
- ``smp.rdp_size()`` : The size of the reduced-data-parallelism group.
301+
- ``smp.get_rdp_process_group()`` : ``torch.distributed.ProcessGroup``
302+
that contains the processes in the current reduced data parallelism
303+
group.
304+
- ``smp.CommGroup.RDP_GROUP`` : The communication group corresponding
305+
to the current reduced data parallelism group.
306+
307+
**Model Parallelism**
308+
309+
- ``smp.mp_rank()`` : The rank of the process within its model-parallelism
310+
group.
311+
- ``smp.mp_size()`` : The size of the model-parallelism group.
312+
- ``smp.get_mp_process_group()`` : ``torch.distributed.ProcessGroup``
313+
that contains the processes in the current model-parallelism group.
314+
- ``smp.CommGroup.MP_GROUP`` : The communication group corresponding to
315+
the current model parallelism group.
316+
317+
**Data Parallelism**
318+
319+
- ``smp.dp_rank()`` : The rank of the process within its data-parallelism
320+
group.
321+
- ``smp.dp_size()`` : The size of the data-parallelism group.
322+
- ``smp.get_dp_process_group()`` : ``torch.distributed.ProcessGroup``
323+
that contains the processes in the current data-parallelism group.
324+
- ``smp.CommGroup.DP_GROUP`` : The communication group corresponding to
325+
the current data-parallelism group.
326+
327+
.. _communication_api:
278328

279329
Communication API
280-
^^^^^^^^^^^^^^^^^
330+
-----------------
281331

282332
The library provides a few communication primitives which can be helpful while
283333
developing the training script. These primitives use the following

0 commit comments

Comments
 (0)