Skip to content

Commit ed8308d

Browse files
committed
add appendix, fix links
1 parent 1896135 commit ed8308d

5 files changed

+257
-37
lines changed

doc/api/training/smd_data_parallel.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
##########################
2-
Distributed data parallel
3-
##########################
1+
###############################################
2+
The SageMaker Distributed Data Parallel Library
3+
###############################################
44

55
SageMaker's distributed data parallel library extends SageMaker’s training
66
capabilities on deep learning models with near-linear scaling efficiency,

doc/api/training/smd_model_parallel.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Distributed model parallel
2-
--------------------------
1+
The SageMaker Distributed Model Parallel Library
2+
------------------------------------------------
33

44
The Amazon SageMaker distributed model parallel library is a model parallelism library for training
55
large deep learning models that were previously difficult to train due to GPU memory limitations.
@@ -14,18 +14,19 @@ See the following sections to learn more about the SageMaker model parallel libr
1414
Use with the SageMaker Python SDK
1515
=================================
1616

17-
Use the following page to learn how to configure and enable distributed model parallel
18-
when you construct an Amazon SageMaker Python SDK `Estimator`.
17+
Walk through the following pages to learn about the library's APIs
18+
to configure and enable distributed model parallelism
19+
through an Amazon SageMaker estimator.
1920

2021
.. toctree::
2122
:maxdepth: 1
2223

2324
smd_model_parallel_general
2425

25-
The library's API to Adapt Training Scripts
26-
===========================================
26+
Use the Library's API to Adapt Training Scripts
27+
===============================================
2728

28-
The library contains a Common API that is shared across frameworks,
29+
The library provides Common APIs that you can use across frameworks,
2930
as well as framework-specific APIs for TensorFlow and PyTorch.
3031

3132
Select the latest or one of the previous versions of the API documentation

doc/api/training/smd_model_parallel_general.rst

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
.. _sm-sdk-modelparallel-params:
22

3-
Required SageMaker Python SDK parameters
4-
========================================
3+
Configuration Parameters for Model Parallelism
4+
==============================================
55

6-
The TensorFlow and PyTorch ``Estimator`` objects contains a ``distribution`` parameter,
6+
Amazon SageMaker's TensorFlow and PyTorch estimator objects contain a ``distribution`` parameter,
77
which is used to enable and specify parameters for the
8-
initialization of the SageMaker distributed model parallel library. The library internally uses MPI,
9-
so in order to use model parallelism, MPI must also be enabled using the ``distribution`` parameter.
8+
initialization of SageMaker distributed training. The library internally uses MPI.
9+
To use model parallelism, both ``smdistributed`` and MPI must be enabled
10+
through the ``distribution`` parameter.
1011

1112
The following is an example of how you can launch a new PyTorch training job with the library.
1213

@@ -50,12 +51,20 @@ The following is an example of how you can launch a new PyTorch training job wit
5051
5152
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
5253
53-
Parameters for ``modelparallel``
54-
================================
54+
For more conceptual guidance, see
55+
`Run a SageMaker Distributed Model Parallel Training Job <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html>`_
56+
in the `SageMaker's Distributed Model Parallel developer guide <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_.
57+
58+
.. contents:: Table of Contents
59+
:depth: 3
60+
:local:
61+
62+
Parameters for ``"smdistributed"``
63+
----------------------------------
5564

5665
You can use the following parameters to initialize the library
5766
configuring a dictionary for ``modelparallel``, which goes
58-
into the ``smdistributed`` of ``distribution``.
67+
into the ``smdistributed`` option for the ``distribution`` parameter.
5968

6069
.. note::
6170

@@ -224,8 +233,8 @@ PyTorch-specific Parameters
224233
loss and GPU bottlenecks during the CPU-to-GPU data transfer.
225234

226235

227-
``mpi`` Parameters
228-
==================
236+
Parameters for ``mpi``
237+
----------------------
229238

230239
For the ``"mpi"`` key, a dict must be passed which contains:
231240

doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
.. admonition:: Contents
2-
3-
- :ref:`communication_api`
4-
- :ref:`mpi_basics`
5-
61
Common API
72
==========
83

94
The following SageMaker distribute model parallel APIs are common across all frameworks.
105

11-
**Important**: This API document assumes you use the following import statement in your training scripts.
6+
.. contents:: Table of Contents
7+
:depth: 3
8+
:local:
9+
10+
The Library's Core APIs
11+
-----------------------
12+
13+
This API document assumes you use the following import statement in your training scripts.
1214

1315
**TensorFlow**
1416

@@ -254,7 +256,7 @@ The following SageMaker distribute model parallel APIs are common across all fra
254256
.. _mpi_basics:
255257

256258
MPI Basics
257-
^^^^^^^^^^
259+
----------
258260

259261
The library exposes the following basic MPI primitives to its Python API:
260262

@@ -326,7 +328,7 @@ The library exposes the following basic MPI primitives to its Python API:
326328
.. _communication_api:
327329

328330
Communication API
329-
^^^^^^^^^^^^^^^^^
331+
-----------------
330332

331333
The library provides a few communication primitives which can be helpful while
332334
developing the training script. These primitives use the following

0 commit comments

Comments
 (0)