|
2 | 2 | PyTorch Guide to SageMaker's distributed data parallel library
|
3 | 3 | ##############################################################
|
4 | 4 |
|
5 |
| -.. admonition:: Contents |
| 5 | +Use this guide to learn about the SageMaker distributed |
| 6 | +data parallel library API for PyTorch. |
6 | 7 |
|
7 |
| - - :ref:`pytorch-sdp-modify` |
8 |
| - - :ref:`pytorch-sdp-api` |
| 8 | +.. contents:: Topics |
| 9 | + :depth: 3 |
| 10 | + :local: |
9 | 11 |
|
10 | 12 | .. _pytorch-sdp-modify:
|
11 | 13 |
|
@@ -55,7 +57,7 @@ API offered for PyTorch.
|
55 | 57 |
|
56 | 58 |
|
57 | 59 | - Modify the ``torch.utils.data.distributed.DistributedSampler`` to
|
58 |
| - include the cluster’s information. Set``num_replicas`` to the |
| 60 | + include the cluster’s information. Set ``num_replicas`` to the |
59 | 61 | total number of GPUs participating in training across all the nodes
|
60 | 62 | in the cluster. This is called ``world_size``. You can get
|
61 | 63 | ``world_size`` with
|
@@ -110,7 +112,7 @@ you will have for distributed training with the distributed data parallel librar
|
110 | 112 | def main():
|
111 | 113 |
|
112 | 114 | # Scale batch size by world size
|
113 |
| - batch_size //= dist.get_world_size() // 8 |
| 115 | + batch_size //= dist.get_world_size() |
114 | 116 | batch_size = max(batch_size, 1)
|
115 | 117 |
|
116 | 118 | # Prepare dataset
|
@@ -153,9 +155,132 @@ you will have for distributed training with the distributed data parallel librar
|
153 | 155 | PyTorch API
|
154 | 156 | ===========
|
155 | 157 |
|
156 |
| -.. rubric:: Supported versions |
| 158 | +.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None) |
| 159 | + |
| 160 | + ``smdistributed.dataparallel``'s implementation of distributed data |
| 161 | + parallelism for PyTorch. In most cases, wrapping your PyTorch Module |
| 162 | + with ``smdistributed.dataparallel``'s ``DistributedDataParallel`` (DDP) is |
| 163 | + all you need to do to use ``smdistributed.dataparallel``. |
| 164 | + |
| 165 | + Creation of this DDP class requires ``smdistributed.dataparallel`` |
| 166 | + already initialized |
| 167 | + with ``smdistributed.dataparallel.torch.distributed.init_process_group()``. |
| 168 | + |
| 169 | + This container parallelizes the application of the given module by |
| 170 | + splitting the input across the specified devices by chunking in the |
| 171 | + batch dimension. The module is replicated on each machine and each |
| 172 | + device, and each such replica handles a portion of the input. During the |
| 173 | + backwards pass, gradients from each node are averaged. |
| 174 | + |
| 175 | + The batch size should be larger than the number of GPUs used locally. |
| 176 | + |
| 177 | + Example usage |
| 178 | + of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``: |
| 179 | + |
| 180 | + .. code:: python |
| 181 | +
|
| 182 | + import torch |
| 183 | + import smdistributed.dataparallel.torch.distributed as dist |
| 184 | + from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP |
| 185 | +
|
| 186 | + dist.init_process_group() |
| 187 | +
|
| 188 | + # Pin GPU to be used to process local rank (one GPU per process) |
| 189 | + torch.cuda.set_device(dist.get_local_rank()) |
| 190 | +
|
| 191 | + # Build model and optimizer |
| 192 | + model = ... |
| 193 | + optimizer = torch.optim.SGD(model.parameters(), |
| 194 | + lr=1e-3 * dist.get_world_size()) |
| 195 | + # Wrap model with smdistributed.dataparallel's DistributedDataParallel |
| 196 | + model = DDP(model) |
| 197 | +
|
| 198 | + **Parameters:** |
| 199 | + |
| 200 | + - ``module (torch.nn.Module)(required):`` PyTorch NN Module to be |
| 201 | + parallelized |
| 202 | + - ``device_ids (list[int])(optional):`` CUDA devices. This should only |
| 203 | + be provided when the input module resides on a single CUDA device. |
| 204 | + For single-device modules, |
| 205 | + the ``ith module replica is placed on device_ids[i]``. For |
| 206 | + multi-device modules and CPU modules, device_ids must be None or an |
| 207 | + empty list, and input data for the forward pass must be placed on the |
| 208 | + correct device. Defaults to ``None``. |
| 209 | + - ``output_device (int)(optional):`` Device location of output for |
| 210 | + single-device CUDA modules. For multi-device modules and CPU modules, |
| 211 | + it must be None, and the module itself dictates the output location. |
| 212 | + (default: device_ids[0] for single-device modules). Defaults |
| 213 | + to ``None``. |
| 214 | + - ``broadcast_buffers (bool)(optional):`` Flag that enables syncing |
| 215 | + (broadcasting) buffers of the module at beginning of the forward |
| 216 | + function. ``smdistributed.dataparallel`` does not support broadcast |
| 217 | + buffer yet. Please set this to ``False``. |
| 218 | + - ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process |
| 219 | + group is not supported in ``smdistributed.dataparallel``. This |
| 220 | + parameter exists for API parity with torch.distributed only. Only |
| 221 | + supported value is |
| 222 | + ``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults |
| 223 | + to ``None.`` |
| 224 | + - ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will |
| 225 | + bucket parameters into multiple buckets so that gradient reduction of |
| 226 | + each bucket can potentially overlap with backward |
| 227 | + computation. ``bucket_cap_mb`` controls the bucket size in |
| 228 | + MegaBytes (MB) (default: 25). |
| 229 | + |
| 230 | + .. note:: |
| 231 | + |
| 232 | + This module assumes all parameters are registered in the model by the |
| 233 | + time it is created. No parameters should be added nor removed later. |
| 234 | + |
| 235 | + .. note:: |
| 236 | + |
| 237 | + This module assumes all parameters are registered in the model of |
| 238 | + each distributed processes are in the same order. The module itself |
| 239 | + will conduct gradient all-reduction following the reverse order of |
| 240 | + the registered parameters of the model. In other words, it is users’ |
| 241 | + responsibility to ensure that each distributed process has the exact |
| 242 | + same model and thus the exact same parameter registration order. |
| 243 | + |
| 244 | + .. note:: |
| 245 | + |
| 246 | + You should never change the set of your model’s parameters after |
| 247 | + wrapping up your model with DistributedDataParallel. In other words, |
| 248 | + when wrapping up your model with DistributedDataParallel, the |
| 249 | + constructor of DistributedDataParallel will register the additional |
| 250 | + gradient reduction functions on all the parameters of the model |
| 251 | + itself at the time of construction. If you change the model’s |
| 252 | + parameters after the DistributedDataParallel construction, this is |
| 253 | + not supported and unexpected behaviors can happen, since some |
| 254 | + parameters’ gradient reduction functions might not get called. |
| 255 | + |
| 256 | + .. method:: no_sync() |
| 257 | + |
| 258 | + ``smdistributed.dataparallel`` supports the `PyTorch DDP no_sync() <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.no_sync>`_ |
| 259 | + context manager. It enables gradient accumulation by skipping AllReduce |
| 260 | + during training iterations inside the context. |
| 261 | + |
| 262 | + .. note:: |
| 263 | + |
| 264 | + The ``no_sync()`` context manager is available from smdistributed-dataparallel v1.2.2. |
| 265 | + To find the release note, see :ref:`sdp_1.2.2_release_note`. |
157 | 266 |
|
158 |
| -**PyTorch 1.7.1, 1.8.1** |
| 267 | + **Example:** |
| 268 | + |
| 269 | + .. code:: python |
| 270 | +
|
| 271 | + # Gradients are accumulated while inside no_sync context |
| 272 | + with model.no_sync(): |
| 273 | + ... |
| 274 | + loss.backward() |
| 275 | +
|
| 276 | + # First iteration upon exiting context |
| 277 | + # Incoming gradients are added to the accumulated gradients and then synchronized via AllReduce |
| 278 | + ... |
| 279 | + loss.backward() |
| 280 | +
|
| 281 | + # Update weights and reset gradients to zero after accumulation is finished |
| 282 | + optimizer.step() |
| 283 | + optimizer.zero_grad() |
159 | 284 |
|
160 | 285 |
|
161 | 286 | .. function:: smdistributed.dataparallel.torch.distributed.is_available()
|
@@ -409,99 +534,6 @@ PyTorch API
|
409 | 534 | otherwise.
|
410 | 535 |
|
411 | 536 |
|
412 |
| -.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None) |
413 |
| - |
414 |
| - ``smdistributed.dataparallel's`` implementation of distributed data |
415 |
| - parallelism for PyTorch. In most cases, wrapping your PyTorch Module |
416 |
| - with ``smdistributed.dataparallel's`` ``DistributedDataParallel (DDP)`` is |
417 |
| - all you need to do to use ``smdistributed.dataparallel``. |
418 |
| - |
419 |
| - Creation of this DDP class requires ``smdistributed.dataparallel`` |
420 |
| - already initialized |
421 |
| - with ``smdistributed.dataparallel.torch.distributed.init_process_group()``. |
422 |
| - |
423 |
| - This container parallelizes the application of the given module by |
424 |
| - splitting the input across the specified devices by chunking in the |
425 |
| - batch dimension. The module is replicated on each machine and each |
426 |
| - device, and each such replica handles a portion of the input. During the |
427 |
| - backwards pass, gradients from each node are averaged. |
428 |
| - |
429 |
| - The batch size should be larger than the number of GPUs used locally. |
430 |
| - |
431 |
| - Example usage |
432 |
| - of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``: |
433 |
| - |
434 |
| - .. code:: python |
435 |
| -
|
436 |
| - import torch |
437 |
| - import smdistributed.dataparallel.torch.distributed as dist |
438 |
| - from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP |
439 |
| -
|
440 |
| - dist.init_process_group() |
441 |
| -
|
442 |
| - # Pin GPU to be used to process local rank (one GPU per process) |
443 |
| - torch.cuda.set_device(dist.get_local_rank()) |
444 |
| -
|
445 |
| - # Build model and optimizer |
446 |
| - model = ... |
447 |
| - optimizer = torch.optim.SGD(model.parameters(), |
448 |
| - lr=1e-3 * dist.get_world_size()) |
449 |
| - # Wrap model with smdistributed.dataparallel's DistributedDataParallel |
450 |
| - model = DDP(model) |
451 |
| -
|
452 |
| - **Parameters:** |
453 |
| - |
454 |
| - - ``module (torch.nn.Module)(required):`` PyTorch NN Module to be |
455 |
| - parallelized |
456 |
| - - ``device_ids (list[int])(optional):`` CUDA devices. This should only |
457 |
| - be provided when the input module resides on a single CUDA device. |
458 |
| - For single-device modules, |
459 |
| - the ``ith module replica is placed on device_ids[i]``. For |
460 |
| - multi-device modules and CPU modules, device_ids must be None or an |
461 |
| - empty list, and input data for the forward pass must be placed on the |
462 |
| - correct device. Defaults to ``None``. |
463 |
| - - ``output_device (int)(optional):`` Device location of output for |
464 |
| - single-device CUDA modules. For multi-device modules and CPU modules, |
465 |
| - it must be None, and the module itself dictates the output location. |
466 |
| - (default: device_ids[0] for single-device modules). Defaults |
467 |
| - to ``None``. |
468 |
| - - ``broadcast_buffers (bool)(optional):`` Flag that enables syncing |
469 |
| - (broadcasting) buffers of the module at beginning of the forward |
470 |
| - function. ``smdistributed.dataparallel`` does not support broadcast |
471 |
| - buffer yet. Please set this to ``False``. |
472 |
| - - ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process |
473 |
| - group is not supported in ``smdistributed.dataparallel``. This |
474 |
| - parameter exists for API parity with torch.distributed only. Only |
475 |
| - supported value is |
476 |
| - ``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults |
477 |
| - to ``None.`` |
478 |
| - - ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will |
479 |
| - bucket parameters into multiple buckets so that gradient reduction of |
480 |
| - each bucket can potentially overlap with backward |
481 |
| - computation. ``bucket_cap_mb`` controls the bucket size in |
482 |
| - MegaBytes (MB) (default: 25). |
483 |
| - |
484 |
| - .. rubric:: Notes |
485 |
| - |
486 |
| - - This module assumes all parameters are registered in the model by the |
487 |
| - time it is created. No parameters should be added nor removed later. |
488 |
| - - This module assumes all parameters are registered in the model of |
489 |
| - each distributed processes are in the same order. The module itself |
490 |
| - will conduct gradient all-reduction following the reverse order of |
491 |
| - the registered parameters of the model. In other words, it is users’ |
492 |
| - responsibility to ensure that each distributed process has the exact |
493 |
| - same model and thus the exact same parameter registration order. |
494 |
| - - You should never change the set of your model’s parameters after |
495 |
| - wrapping up your model with DistributedDataParallel. In other words, |
496 |
| - when wrapping up your model with DistributedDataParallel, the |
497 |
| - constructor of DistributedDataParallel will register the additional |
498 |
| - gradient reduction functions on all the parameters of the model |
499 |
| - itself at the time of construction. If you change the model’s |
500 |
| - parameters after the DistributedDataParallel construction, this is |
501 |
| - not supported and unexpected behaviors can happen, since some |
502 |
| - parameters’ gradient reduction functions might not get called. |
503 |
| - |
504 |
| - |
505 | 537 | .. class:: smdistributed.dataparallel.torch.distributed.ReduceOp
|
506 | 538 |
|
507 | 539 | An enum-like class for supported reduction operations
|
|
0 commit comments