Skip to content

Commit 71c5617

Browse files
authored
doc: Enhance smddp 1.2.2 doc (#2852)
1 parent ccfcbe7 commit 71c5617

File tree

4 files changed

+178
-117
lines changed

4 files changed

+178
-117
lines changed

doc/Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
# You can set these variables from the command line.
55
SPHINXOPTS = -W
6-
SPHINXBUILD = python -msphinx
6+
SPHINXBUILD = python -msphinx
77
SPHINXPROJ = sagemaker
88
SOURCEDIR = .
99
BUILDDIR = _build

doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst

+132-100
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,12 @@
22
PyTorch Guide to SageMaker's distributed data parallel library
33
##############################################################
44

5-
.. admonition:: Contents
5+
Use this guide to learn about the SageMaker distributed
6+
data parallel library API for PyTorch.
67

7-
- :ref:`pytorch-sdp-modify`
8-
- :ref:`pytorch-sdp-api`
8+
.. contents:: Topics
9+
:depth: 3
10+
:local:
911

1012
.. _pytorch-sdp-modify:
1113

@@ -55,7 +57,7 @@ API offered for PyTorch.
5557
5658
5759
- Modify the ``torch.utils.data.distributed.DistributedSampler`` to
58-
include the cluster’s information. Set``num_replicas`` to the
60+
include the cluster’s information. Set ``num_replicas`` to the
5961
total number of GPUs participating in training across all the nodes
6062
in the cluster. This is called ``world_size``. You can get
6163
``world_size`` with
@@ -110,7 +112,7 @@ you will have for distributed training with the distributed data parallel librar
110112
def main():
111113
112114
    # Scale batch size by world size
113-
    batch_size //= dist.get_world_size() // 8
115+
    batch_size //= dist.get_world_size()
114116
    batch_size = max(batch_size, 1)
115117
116118
    # Prepare dataset
@@ -153,9 +155,132 @@ you will have for distributed training with the distributed data parallel librar
153155
PyTorch API
154156
===========
155157

156-
.. rubric:: Supported versions
158+
.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None)
159+
160+
``smdistributed.dataparallel``'s implementation of distributed data
161+
parallelism for PyTorch. In most cases, wrapping your PyTorch Module
162+
with ``smdistributed.dataparallel``'s ``DistributedDataParallel`` (DDP) is
163+
all you need to do to use ``smdistributed.dataparallel``.
164+
165+
Creation of this DDP class requires ``smdistributed.dataparallel``
166+
already initialized
167+
with ``smdistributed.dataparallel.torch.distributed.init_process_group()``.
168+
169+
This container parallelizes the application of the given module by
170+
splitting the input across the specified devices by chunking in the
171+
batch dimension. The module is replicated on each machine and each
172+
device, and each such replica handles a portion of the input. During the
173+
backwards pass, gradients from each node are averaged.
174+
175+
The batch size should be larger than the number of GPUs used locally.
176+
177+
Example usage
178+
of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``:
179+
180+
.. code:: python
181+
182+
import torch
183+
import smdistributed.dataparallel.torch.distributed as dist
184+
from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP
185+
186+
dist.init_process_group()
187+
188+
# Pin GPU to be used to process local rank (one GPU per process)
189+
torch.cuda.set_device(dist.get_local_rank())
190+
191+
# Build model and optimizer
192+
model = ...
193+
optimizer = torch.optim.SGD(model.parameters(),
194+
                            lr=1e-3 * dist.get_world_size())
195+
# Wrap model with smdistributed.dataparallel's DistributedDataParallel
196+
model = DDP(model)
197+
198+
**Parameters:**
199+
200+
- ``module (torch.nn.Module)(required):`` PyTorch NN Module to be
201+
parallelized
202+
- ``device_ids (list[int])(optional):`` CUDA devices. This should only
203+
be provided when the input module resides on a single CUDA device.
204+
For single-device modules,
205+
the ``ith module replica is placed on device_ids[i]``. For
206+
multi-device modules and CPU modules, device_ids must be None or an
207+
empty list, and input data for the forward pass must be placed on the
208+
correct device. Defaults to ``None``.
209+
- ``output_device (int)(optional):`` Device location of output for
210+
single-device CUDA modules. For multi-device modules and CPU modules,
211+
it must be None, and the module itself dictates the output location.
212+
(default: device_ids[0] for single-device modules).  Defaults
213+
to ``None``.
214+
- ``broadcast_buffers (bool)(optional):`` Flag that enables syncing
215+
(broadcasting) buffers of the module at beginning of the forward
216+
function. ``smdistributed.dataparallel`` does not support broadcast
217+
buffer yet. Please set this to ``False``.
218+
- ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process
219+
group is not supported in ``smdistributed.dataparallel``. This
220+
parameter exists for API parity with torch.distributed only. Only
221+
supported value is
222+
``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults
223+
to ``None.``
224+
- ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will
225+
bucket parameters into multiple buckets so that gradient reduction of
226+
each bucket can potentially overlap with backward
227+
computation. ``bucket_cap_mb`` controls the bucket size in
228+
MegaBytes (MB) (default: 25).
229+
230+
.. note::
231+
232+
This module assumes all parameters are registered in the model by the
233+
time it is created. No parameters should be added nor removed later.
234+
235+
.. note::
236+
237+
This module assumes all parameters are registered in the model of
238+
each distributed processes are in the same order. The module itself
239+
will conduct gradient all-reduction following the reverse order of
240+
the registered parameters of the model. In other words, it is users’
241+
responsibility to ensure that each distributed process has the exact
242+
same model and thus the exact same parameter registration order.
243+
244+
.. note::
245+
246+
You should never change the set of your model’s parameters after
247+
wrapping up your model with DistributedDataParallel. In other words,
248+
when wrapping up your model with DistributedDataParallel, the
249+
constructor of DistributedDataParallel will register the additional
250+
gradient reduction functions on all the parameters of the model
251+
itself at the time of construction. If you change the model’s
252+
parameters after the DistributedDataParallel construction, this is
253+
not supported and unexpected behaviors can happen, since some
254+
parameters’ gradient reduction functions might not get called.
255+
256+
.. method:: no_sync()
257+
258+
``smdistributed.dataparallel`` supports the `PyTorch DDP no_sync() <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.no_sync>`_
259+
context manager. It enables gradient accumulation by skipping AllReduce
260+
during training iterations inside the context.
261+
262+
.. note::
263+
264+
The ``no_sync()`` context manager is available from smdistributed-dataparallel v1.2.2.
265+
To find the release note, see :ref:`sdp_1.2.2_release_note`.
157266

158-
**PyTorch 1.7.1, 1.8.1**
267+
**Example:**
268+
269+
.. code:: python
270+
271+
# Gradients are accumulated while inside no_sync context
272+
with model.no_sync():
273+
...
274+
loss.backward()
275+
276+
# First iteration upon exiting context
277+
# Incoming gradients are added to the accumulated gradients and then synchronized via AllReduce
278+
...
279+
loss.backward()
280+
281+
# Update weights and reset gradients to zero after accumulation is finished
282+
optimizer.step()
283+
optimizer.zero_grad()
159284
160285
161286
.. function:: smdistributed.dataparallel.torch.distributed.is_available()
@@ -409,99 +534,6 @@ PyTorch API
409534
otherwise.
410535

411536

412-
.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None)
413-
414-
``smdistributed.dataparallel's`` implementation of distributed data
415-
parallelism for PyTorch. In most cases, wrapping your PyTorch Module
416-
with ``smdistributed.dataparallel's`` ``DistributedDataParallel (DDP)`` is
417-
all you need to do to use ``smdistributed.dataparallel``.
418-
419-
Creation of this DDP class requires ``smdistributed.dataparallel``
420-
already initialized
421-
with ``smdistributed.dataparallel.torch.distributed.init_process_group()``.
422-
423-
This container parallelizes the application of the given module by
424-
splitting the input across the specified devices by chunking in the
425-
batch dimension. The module is replicated on each machine and each
426-
device, and each such replica handles a portion of the input. During the
427-
backwards pass, gradients from each node are averaged.
428-
429-
The batch size should be larger than the number of GPUs used locally.
430-
431-
Example usage
432-
of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``:
433-
434-
.. code:: python
435-
436-
import torch
437-
import smdistributed.dataparallel.torch.distributed as dist
438-
from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP
439-
440-
dist.init_process_group()
441-
442-
# Pin GPU to be used to process local rank (one GPU per process)
443-
torch.cuda.set_device(dist.get_local_rank())
444-
445-
# Build model and optimizer
446-
model = ...
447-
optimizer = torch.optim.SGD(model.parameters(),
448-
                            lr=1e-3 * dist.get_world_size())
449-
# Wrap model with smdistributed.dataparallel's DistributedDataParallel
450-
model = DDP(model)
451-
452-
**Parameters:**
453-
454-
- ``module (torch.nn.Module)(required):`` PyTorch NN Module to be
455-
parallelized
456-
- ``device_ids (list[int])(optional):`` CUDA devices. This should only
457-
be provided when the input module resides on a single CUDA device.
458-
For single-device modules,
459-
the ``ith module replica is placed on device_ids[i]``. For
460-
multi-device modules and CPU modules, device_ids must be None or an
461-
empty list, and input data for the forward pass must be placed on the
462-
correct device. Defaults to ``None``.
463-
- ``output_device (int)(optional):`` Device location of output for
464-
single-device CUDA modules. For multi-device modules and CPU modules,
465-
it must be None, and the module itself dictates the output location.
466-
(default: device_ids[0] for single-device modules).  Defaults
467-
to ``None``.
468-
- ``broadcast_buffers (bool)(optional):`` Flag that enables syncing
469-
(broadcasting) buffers of the module at beginning of the forward
470-
function. ``smdistributed.dataparallel`` does not support broadcast
471-
buffer yet. Please set this to ``False``.
472-
- ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process
473-
group is not supported in ``smdistributed.dataparallel``. This
474-
parameter exists for API parity with torch.distributed only. Only
475-
supported value is
476-
``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults
477-
to ``None.``
478-
- ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will
479-
bucket parameters into multiple buckets so that gradient reduction of
480-
each bucket can potentially overlap with backward
481-
computation. ``bucket_cap_mb`` controls the bucket size in
482-
MegaBytes (MB) (default: 25).
483-
484-
.. rubric:: Notes
485-
486-
- This module assumes all parameters are registered in the model by the
487-
time it is created. No parameters should be added nor removed later.
488-
- This module assumes all parameters are registered in the model of
489-
each distributed processes are in the same order. The module itself
490-
will conduct gradient all-reduction following the reverse order of
491-
the registered parameters of the model. In other words, it is users’
492-
responsibility to ensure that each distributed process has the exact
493-
same model and thus the exact same parameter registration order.
494-
- You should never change the set of your model’s parameters after
495-
wrapping up your model with DistributedDataParallel. In other words,
496-
when wrapping up your model with DistributedDataParallel, the
497-
constructor of DistributedDataParallel will register the additional
498-
gradient reduction functions on all the parameters of the model
499-
itself at the time of construction. If you change the model’s
500-
parameters after the DistributedDataParallel construction, this is
501-
not supported and unexpected behaviors can happen, since some
502-
parameters’ gradient reduction functions might not get called.
503-
504-
505537
.. class:: smdistributed.dataparallel.torch.distributed.ReduceOp
506538

507539
An enum-like class for supported reduction operations

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

-4
Original file line numberDiff line numberDiff line change
@@ -155,10 +155,6 @@ script you will have for distributed training with the library.
155155
TensorFlow API
156156
==============
157157

158-
.. rubric:: Supported versions
159-
160-
**TensorFlow 2.3.1, 2.4.1, 2.5.0**
161-
162158
.. function:: smdistributed.dataparallel.tensorflow.init()
163159

164160
Initialize ``smdistributed.dataparallel``. Must be called at the

0 commit comments

Comments
 (0)