Skip to content

Commit 83d754e

Browse files
committed
documentation: add doc for smddp collectives
1 parent 558e3f8 commit 83d754e

File tree

2 files changed

+111
-5
lines changed

2 files changed

+111
-5
lines changed

doc/frameworks/pytorch/using_pytorch.rst

+65-2
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,8 @@ The function of installing packages using ``requirements.txt`` is supported for
140140

141141
A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. For information about the format of a ``requirements.txt`` file, see `Requirements Files <https://pip.pypa.io/en/stable/user_guide/#requirements-files>`__ in the pip documentation.
142142

143+
-----
144+
143145
Create an Estimator
144146
===================
145147

@@ -161,7 +163,7 @@ directories ('train' and 'test').
161163
'test': 's3://my-data-bucket/path/to/my/test/data'})
162164
163165
164-
166+
-----
165167

166168
Call the fit Method
167169
===================
@@ -196,6 +198,7 @@ fit Optional Arguments
196198
- ``logs``: Defaults to True, whether to show logs produced by training
197199
job in the Python session. Only meaningful when wait is True.
198200

201+
-----
199202

200203
Distributed PyTorch Training
201204
============================
@@ -267,7 +270,7 @@ during the PyTorch DDP initialization.
267270
For more information about setting up PyTorch DDP in your training script,
268271
see `Getting Started with Distributed Data Parallel
269272
<https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
270-
PyTorch documentation.
273+
*PyTorch documentation*.
271274

272275
The following example shows how to run a PyTorch DDP training in SageMaker
273276
using two ``ml.p4d.24xlarge`` instances:
@@ -292,6 +295,64 @@ using two ``ml.p4d.24xlarge`` instances:
292295
293296
pt_estimator.fit("s3://bucket/path/to/training/data")
294297
298+
SMDDP collectives for PyTorch DDP
299+
---------------------------------
300+
301+
Starting PyTorch 1.12.1, when you run PyTorch DDP training jobs on SageMaker,
302+
you can benefit from the performance gain by using *SMDDP collectives*
303+
with *no code changes* in your PyTorch training script.
304+
*SMDDP collectives* is the
305+
collective communication primitives offered by the `SageMaker data parallelism library
306+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html>`_.
307+
The SMDDP collectives provide a GPU-based
308+
AllReduce operation optimized for the AWS network infrastructure and the Amazon EC2
309+
instance topology. These collectives are designed to address suboptimal performance issues
310+
in PyTorch DDP distributed training when using NCCL, the general-purpose collective communications library
311+
from NVIDIA, on the AWS infrastructure.
312+
313+
The SMDDP collectives are on by default for the ``pytorchddp`` distribution option
314+
if the job configuration is supported. To achieve the most optimal training with
315+
faster communication, SageMaker compares the performance of the SMDDP and NCCL
316+
collectives at runtime and automatically selects the backend that performs better.
317+
318+
This means that by default the backend parameter of the ``communication_options`` for
319+
``pytorchddp`` distribution is set to ``auto``. The goal of the automated backend
320+
selection is to abstract away the overhead of deciding the right communication backend
321+
for your training job, while ensuring that you get the best possible training performance
322+
at a lower cost.
323+
324+
The following code shows an example of setting up a PyTorch estimator
325+
to run a PyTorch DDP training job on two ``ml.p4d.24xlarge`` instances
326+
with the communication backed set to ``auto``.
327+
328+
.. code:: python
329+
330+
from sagemaker.pytorch import PyTorch
331+
332+
pt_estimator = PyTorch(
333+
entry_point="train_ptddp.py",
334+
role="SageMakerRole",
335+
framework_version="1.12.1",
336+
py_version="py38",
337+
instance_count=2,
338+
instance_type="ml.p4d.24xlarge",
339+
distribution={
340+
"pytorchddp": {
341+
"enabled": True,
342+
"communication_options": { # Optional. This is set to auto by default.
343+
"backend": "auto"
344+
}
345+
}
346+
}
347+
)
348+
349+
pt_estimator.fit("s3://bucket/path/to/training/data")
350+
351+
If you want to turn this automatic backend selection off and force to use NCCL,
352+
you can explicitly set ``"backend": "nccl"`` as the communication backend.
353+
354+
-----
355+
295356
.. _distributed-pytorch-training-on-trainium:
296357

297358
Distributed Training with PyTorch Neuron on Trn1 instances
@@ -407,6 +468,8 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
407468
408469
pt_estimator.fit("s3://bucket/path/to/training/data")
409470
471+
-----
472+
410473
*********************
411474
Deploy PyTorch Models
412475
*********************

src/sagemaker/pytorch/estimator.py

+46-3
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,50 @@ def __init__(
175175
To learn more, see `Distributed PyTorch Training
176176
<https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training>`_.
177177
178+
.. note::
179+
180+
Starting PyTorch 1.12.1, for PyTorch DDP distributed training jobs,
181+
SageMaker implements *SMDDP collectives*, which are
182+
communication collective primatives offered by the `SageMaker data parallelism library
183+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html>`_.
184+
For all PyTorch DDP jobs, SageMaker
185+
automatically selects the most optimal communication backend between the
186+
SMDDP collectives and NCCL. This automated selection of
187+
the collectives ensures that you get the best possible
188+
training performance from faster communication at a lower cost.
189+
190+
This means that, by default, the backend parameter of the
191+
``communication_options`` for ``pytorchddp`` distribution is
192+
set to ``auto`` as follows.
193+
194+
.. code:: python
195+
196+
{
197+
"pytorchddp": {
198+
"enabled": True,
199+
"communication_options": {
200+
"backend": "auto", # backend is set to auto by default
201+
}
202+
}
203+
}
204+
205+
In cases where you want to force to use NCCL, you can explicitly set the communication
206+
backend to ``nccl`` as follows:
207+
208+
.. code:: python
209+
210+
{
211+
"pytorchddp": {
212+
"enabled": True,
213+
"communication_options": {
214+
"backend": "nccl", # backend is set to auto by default
215+
}
216+
}
217+
}
218+
219+
To learn more about the SMDDP collectives, see also `Distributed PyTorch Training
220+
<https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training>`_.
221+
178222
**To enable Torch Distributed (for Trainium instances only):**
179223
180224
.. code:: python
@@ -215,8 +259,7 @@ def __init__(
215259
<https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-parameter-servers>`_.
216260
217261
**To enable distributed training with
218-
`SageMaker Training Compiler <https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_
219-
for PyTorch:**
262+
SageMaker Training Compiler for PyTorch:**
220263
221264
.. code:: python
222265
@@ -236,7 +279,7 @@ def __init__(
236279
you must add the ``compiler_config`` parameter and activate SageMaker
237280
Training Compiler.
238281
239-
compiler_config (:class:`~sagemaker.pytorch.TrainingCompilerConfig`):
282+
compiler_config (:class:`~sagemaker.pytorch.TrainingCompilerConfig`):
240283
Configures SageMaker Training Compiler to accelerate training.
241284
242285
**kwargs: Additional kwargs passed to the :class:`~sagemaker.estimator.Framework`

0 commit comments

Comments
 (0)