Skip to content

Commit 477fdac

Browse files
committed
incorporate feedback
1 parent 4c1a1e1 commit 477fdac

File tree

4 files changed

+91
-127
lines changed

4 files changed

+91
-127
lines changed

doc/api/training/smd_model_parallel_general.rst

+86-122
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Use with the SageMaker Python SDK
33
#################################
44

5-
Walk through the following pages to learn about the library's APIs
5+
Walk through the following pages to learn about the SageMaker model parallel library's APIs
66
to configure and enable distributed model parallelism
77
through an Amazon SageMaker estimator.
88

@@ -12,56 +12,19 @@ Configuration Parameters for ``distribution``
1212
=============================================
1313

1414
Amazon SageMaker's TensorFlow and PyTorch estimator objects contain a ``distribution`` parameter,
15-
which is used to enable and specify parameters for the
16-
initialization of SageMaker distributed training. The library internally uses MPI.
15+
which you can use to enable and specify parameters for SageMaker distributed training.
16+
The SageMaker model parallel library internally uses MPI.
1717
To use model parallelism, both ``smdistributed`` and MPI must be enabled
1818
through the ``distribution`` parameter.
1919

20-
The following is an example of how you can launch a new PyTorch training job with the library.
21-
22-
.. code-block:: python
23-
24-
sagemaker_session = sagemaker.session.Session(boto_session=session)
25-
26-
mpi_options = {
27-
"enabled" : True,
28-
"processes_per_host" : 8,
29-
"custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
30-
}
31-
32-
smp_options = {
33-
"enabled":True,
34-
"parameters": {
35-
"microbatches": 4,
36-
"placement_strategy": "spread",
37-
"pipeline": "interleaved",
38-
"optimize": "speed",
39-
"partitions": 2,
40-
"ddp": True,
41-
}
42-
}
43-
44-
smd_mp_estimator = PyTorch(
45-
entry_point="training-script.py", # Pick your train script
46-
source_dir='utils',
47-
role=role,
48-
instance_type='ml.p3.16xlarge',
49-
sagemaker_session=sagemaker_session,
50-
framework_version='1.6.0',
51-
py_version='py3',
52-
instance_count=1,
53-
distribution={
54-
"smdistributed": {"modelparallel": smp_options},
55-
"mpi": mpi_options
56-
},
57-
base_job_name="SMD-MP-demo",
58-
)
59-
60-
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
61-
62-
For more conceptual guidance, see
63-
`Run a SageMaker Distributed Model Parallel Training Job <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html>`_
64-
in the `SageMaker's Distributed Model Parallel developer guide <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_.
20+
.. tip::
21+
22+
This page provides you a complete list of parameters you can use
23+
when you construct a SageMaker estimator and configure for distributed training.
24+
25+
To find examples of how to construct a SageMaker estimator with the distributed training parameters, see
26+
`Launch a SageMaker Distributed Model Parallel Training Job <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html>`_
27+
in the `SageMaker's Distributed Model Parallel developer guide <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`_.
6528

6629
.. contents:: Table of Contents
6730
:depth: 3
@@ -91,8 +54,8 @@ Common Parameters
9154
- Type / Valid values
9255
- Default
9356
- Description
94-
* - ``partitions`` for TensorFlow,
95-
``pipeline_parallel_degree`` for PyTorch v1.8.1 (**smdistributed-modelparallel**>=v1.6)
57+
* - ``partitions`` for TensorFlow and PyTorch with smdistributed-modelparallel<v1.6,
58+
``pipeline_parallel_degree`` for PyTorch v1.8.1 with smdistributed-modelparallel>=v1.6)
9659
- int
9760
-
9861
- **Required.** The number of partitions to split the model into.
@@ -141,37 +104,6 @@ Common Parameters
141104
- ``0``
142105
- **Required** if ``auto_partition`` is false. The partition ID to place operations/modules
143106
that are not placed in any ``smp.partition`` contexts.
144-
* - ``tensor_parallel_degree``
145-
- int
146-
- 1
147-
- The number of devices over which the tensor parallel modules will be distributed.
148-
If ``tensor_parallel_degree`` is greater than 1, then ``ddp`` must be set to ``True``.
149-
* - ``fp16_params`` (**smdistributed-modelparallel**>=v1.6)
150-
- bool
151-
- ``False``
152-
- If ``True``, the parameters of the distributed modules will be initialized in FP16.
153-
* - ``shard_optimizer_state`` (**smdistributed-modelparallel**>=v1.6)
154-
- bool
155-
- ``False``
156-
- If ``True``, the library shards the optimizer state of all parameters across
157-
the data parallel processes which hold the same parameter.
158-
This optimizer state sharding happens in a balanced manner.
159-
Note that when sharding optimizer state, full optimizer saving is not currently supported.
160-
Please save partial optimizer state. For more information about saving and loading checkpoints with
161-
optimizer state sharding, see `Instructions for Checkpointing with Tensor Parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-saving-loading-checkpoints.html>`_.
162-
* - ``prescaled_batch`` (**smdistributed-modelparallel**>=v1.6)
163-
- bool
164-
- ``False``
165-
- If ``True`` and when ``smp.nn.DistributedTransformerLMHead`` is used
166-
(this is typically used for GPT-2 or GPT-3 models),
167-
the library assumes that the devices in the same tensor parallelism group
168-
receive the same input data. Otherwise, it is assumed that they receive
169-
different examples. To learn more, see :ref:`prescaled-batch`.
170-
* - ``skip_tracing`` (**smdistributed-modelparallel**>=v1.6)
171-
- bool
172-
- False
173-
- Skips the initial tracing step. This can be useful in very large models
174-
where even model tracing at the CPU is not possible due to memory constraints.
175107

176108
TensorFlow-specific Parameters
177109
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -198,47 +130,78 @@ PyTorch-specific Parameters
198130
~~~~~~~~~~~~~~~~~~~~~~~~~~~
199131

200132
.. list-table::
201-
:widths: 10 20 10 60
202-
:header-rows: 1
203-
204-
* - Parameter
205-
- Type / Valid values
206-
- Default
207-
- Description
208-
* - ``memory_weight``
209-
- float [0.0, 1.0]
210-
- ``0.2`` if ``optimize`` is ``"speed"``, else ``0.8``
211-
- The weight of memory balancing in the auto-partitioni ng objective, as opposed to balancing computational load. If 0.0, the library only tries to balance computation; if 1.0 the library only tries to balance the memory use. Any value in between interpolates between these extremes.
212-
* - ``ddp``
213-
- bool
214-
- ``False``
215-
- Must be set to True if hybrid model/data parallelism is used with DistributedDataParallel. DistributedDataParallel is used with NCCL backend, and uses the MASTER_PORT provided by SageMaker.
216-
* - ``active_microbatches`` (**smdistributed-modelparallel**>=v1.3)
217-
- int
218-
- ``partitions`` + 2
219-
- This is the maximum number of microbatches that are simultaneously in execution during pipelining. Jointly scaling batch size and number of microbatches can often mitigate the pipeline bubble overhead, but that can lead to increased memory usage if too many microbatches are simultaneously in execution. In such cases setting the number of active microbatches to a lower number can help control memory usage. By default this is set to two plus the number of partitions of the model.
220-
* - ``deterministic_server`` (**smdistributed-modelparallel**>=v1.3)
221-
- bool
222-
- ``False``
223-
- Setting this to true ensures that the execution server for pipelining executes requests in the same order across all data parallel ranks.
224-
* - ``offload_activations`` (**smdistributed-modelparallel**>=v1.6)
225-
- bool
226-
- False
227-
- Enables activation
228-
offloading. To improve GPU memory usage, use activation offloading
229-
only when (1) the ``microbatches`` and ``active_microbatches`` are
230-
greater than 1, and (2) activation checkpointing is enabled for at
231-
least one module in the model.
232-
* - ``activation_loading_horizon`` (**smdistributed-modelparallel**>=v1.6)
233-
- int
234-
- 4
235-
- Specify the number
236-
of pipeline tasks. This determines how early the activations should
237-
be loaded back to the GPU, expressed in number of pipeline tasks.
238-
Smaller value indicates that activations are loaded closer in time to
239-
when they are needed for backward pass. Setting this value too small
240-
might improve memory usage, but might potentially cause throughput
241-
loss and GPU bottlenecks during the CPU-to-GPU data transfer.
133+
:widths: 10 20 10 60
134+
:header-rows: 1
135+
136+
* - Parameter
137+
- Type / Valid values
138+
- Default
139+
- Description
140+
* - ``memory_weight``
141+
- float [0.0, 1.0]
142+
- ``0.2`` if ``optimize`` is ``"speed"``, else ``0.8``
143+
- The weight of memory balancing in the auto-partitioni ng objective, as opposed to balancing computational load. If 0.0, the library only tries to balance computation; if 1.0 the library only tries to balance the memory use. Any value in between interpolates between these extremes.
144+
* - ``ddp``
145+
- bool
146+
- ``False``
147+
- Must be set to True if hybrid model/data parallelism is used with DistributedDataParallel. DistributedDataParallel is used with NCCL backend, and uses the MASTER_PORT provided by SageMaker.
148+
* - ``active_microbatches`` (**smdistributed-modelparallel**>=v1.3)
149+
- int
150+
- ``partitions`` + 2
151+
- This is the maximum number of microbatches that are simultaneously in execution during pipelining. Jointly scaling batch size and number of microbatches can often mitigate the pipeline bubble overhead, but that can lead to increased memory usage if too many microbatches are simultaneously in execution. In such cases setting the number of active microbatches to a lower number can help control memory usage. By default this is set to two plus the number of partitions of the model.
152+
* - ``deterministic_server`` (**smdistributed-modelparallel**>=v1.3)
153+
- bool
154+
- ``False``
155+
- Setting this to true ensures that the execution server for pipelining executes requests in the same order across all data parallel ranks.
156+
* - ``offload_activations`` (**smdistributed-modelparallel**>=v1.6)
157+
- bool
158+
- False
159+
- Enables activation
160+
offloading. To improve GPU memory usage, use activation offloading
161+
only when (1) the ``microbatches`` and ``active_microbatches`` are
162+
greater than 1, and (2) activation checkpointing is enabled for at
163+
least one module in the model.
164+
* - ``activation_loading_horizon`` (**smdistributed-modelparallel**>=v1.6)
165+
- int
166+
- 4
167+
- Specify the number
168+
of pipeline tasks. This determines how early the activations should
169+
be loaded back to the GPU, expressed in number of pipeline tasks.
170+
Smaller value indicates that activations are loaded closer in time to
171+
when they are needed for backward pass. Setting this value too small
172+
might improve memory usage, but might potentially cause throughput
173+
loss and GPU bottlenecks during the CPU-to-GPU data transfer.
174+
* - ``tensor_parallel_degree`` (**smdistributed-modelparallel**>=v1.6)
175+
- int
176+
- 1
177+
- The number of devices over which the tensor parallel modules will be distributed.
178+
If ``tensor_parallel_degree`` is greater than 1, then ``ddp`` must be set to ``True``.
179+
* - ``fp16_params`` (**smdistributed-modelparallel**>=v1.6)
180+
- bool
181+
- ``False``
182+
- If ``True``, the parameters of the distributed modules will be initialized in FP16.
183+
* - ``shard_optimizer_state`` (**smdistributed-modelparallel**>=v1.6)
184+
- bool
185+
- ``False``
186+
- If ``True``, the library shards the optimizer state of all parameters across
187+
the data parallel processes which hold the same parameter.
188+
This optimizer state sharding happens in a balanced manner.
189+
Note that when sharding optimizer state, full optimizer saving is not currently supported.
190+
Please save partial optimizer state. For more information about saving and loading checkpoints with
191+
optimizer state sharding, see `Instructions for Checkpointing with Tensor Parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-saving-loading-checkpoints.html>`_.
192+
* - ``prescaled_batch`` (**smdistributed-modelparallel**>=v1.6)
193+
- bool
194+
- ``False``
195+
- If ``True`` and when ``smp.nn.DistributedTransformerLMHead`` is used
196+
(this is typically used for GPT-2 or GPT-3 models),
197+
the library assumes that the devices in the same tensor parallelism group
198+
receive the same input data. Otherwise, it is assumed that they receive
199+
different examples. To learn more, see :ref:`prescaled-batch`.
200+
* - ``skip_tracing`` (**smdistributed-modelparallel**>=v1.6)
201+
- bool
202+
- False
203+
- Skips the initial tracing step. This can be useful in very large models
204+
where even model tracing at the CPU is not possible due to memory constraints.
242205

243206

244207
Parameters for ``mpi``
@@ -338,6 +301,7 @@ Placement Strategy with Tensor Parallelism
338301
In addition to the two placement strategies introduced in the previous section,
339302
the library provides additional placement strategies for extended tensor parallelism features
340303
for PyTorch. The additional placement strategies (parallelism types) are denoted as follows:
304+
341305
- ``D`` stands for (reduced) data parallelism.
342306
- ``P`` stands for pipeline parallelism.
343307
- ``T`` stands for tensor parallelism.

doc/api/training/smp_versions/archives.rst

-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
.. _smdmp-pt-version-archive:
22

3-
Documentation Archive
4-
=====================
5-
63
.. toctree::
74
:maxdepth: 1
85

doc/api/training/smp_versions/latest.rst

+4
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,10 @@ To use the library, reference the Common API documentation alongside the framewo
2626
To find archived API documentation for the previous versions of the library,
2727
see the following link:
2828

29+
30+
Documentation Archive
31+
=====================
32+
2933
.. toctree::
3034
:maxdepth: 1
3135

doc/api/training/smp_versions/latest/smd_model_parallel_common_api.rst

+1-2
Original file line numberDiff line numberDiff line change
@@ -266,8 +266,7 @@ The library exposes the following basic MPI primitives to its Python API:
266266
- ``smp.size()`` : The total number of processes.
267267
- ``smp.get_world_process_group()`` :
268268
``torch.distributed.ProcessGroup`` that contains all processes.
269-
- ``[smp.CommGroup.WORLD](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_common_api.html#smp.CommGroup)``
270-
: The communication group corresponding to all processes.
269+
- ``smp.CommGroup.WORLD``: The communication group corresponding to all processes.
271270
- ``smp.local_rank()``: The rank among the processes on the current instance.
272271
- ``smp.local_size()``: The total number of processes on the current instance.
273272
- ``smp.get_mp_group()``: The list of ranks over which the current model replica is partitioned.

0 commit comments

Comments
 (0)