@@ -140,6 +140,8 @@ The function of installing packages using ``requirements.txt`` is supported for
140
140
141
141
A ``requirements.txt `` file is a text file that contains a list of items that are installed by using ``pip install ``. You can also specify the version of an item to install. For information about the format of a ``requirements.txt `` file, see `Requirements Files <https://pip.pypa.io/en/stable/user_guide/#requirements-files >`__ in the pip documentation.
142
142
143
+ -----
144
+
143
145
Create an Estimator
144
146
===================
145
147
@@ -161,7 +163,7 @@ directories ('train' and 'test').
161
163
' test' : ' s3://my-data-bucket/path/to/my/test/data' })
162
164
163
165
164
-
166
+ -----
165
167
166
168
Call the fit Method
167
169
===================
@@ -196,6 +198,7 @@ fit Optional Arguments
196
198
- ``logs ``: Defaults to True, whether to show logs produced by training
197
199
job in the Python session. Only meaningful when wait is True.
198
200
201
+ -----
199
202
200
203
Distributed PyTorch Training
201
204
============================
@@ -267,7 +270,7 @@ during the PyTorch DDP initialization.
267
270
For more information about setting up PyTorch DDP in your training script,
268
271
see `Getting Started with Distributed Data Parallel
269
272
<https://pytorch.org/tutorials/intermediate/ddp_tutorial.html> `_ in the
270
- PyTorch documentation.
273
+ * PyTorch documentation * .
271
274
272
275
The following example shows how to run a PyTorch DDP training in SageMaker
273
276
using two ``ml.p4d.24xlarge `` instances:
@@ -292,6 +295,64 @@ using two ``ml.p4d.24xlarge`` instances:
292
295
293
296
pt_estimator.fit(" s3://bucket/path/to/training/data" )
294
297
298
+ SMDDP collectives for PyTorch DDP
299
+ ---------------------------------
300
+
301
+ Starting PyTorch 1.12.1, when you run PyTorch DDP training jobs on SageMaker,
302
+ you can benefit from the performance gain by using *SMDDP collectives *
303
+ with *no code changes * in your PyTorch training script.
304
+ *SMDDP collectives * is the
305
+ collective communication primitives offered by the `SageMaker data parallelism library
306
+ <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html> `_.
307
+ The SMDDP collectives provide a GPU-based
308
+ AllReduce operation optimized for the AWS network infrastructure and the Amazon EC2
309
+ instance topology. These collectives are designed to address suboptimal performance issues
310
+ in PyTorch DDP distributed training when using NCCL, the general-purpose collective communications library
311
+ from NVIDIA, on the AWS infrastructure.
312
+
313
+ The SMDDP collectives are on by default for the ``pytorchddp `` distribution option
314
+ if the job configuration is supported. To achieve the most optimal training with
315
+ faster communication, SageMaker compares the performance of the SMDDP and NCCL
316
+ collectives at runtime and automatically selects the backend that performs better.
317
+
318
+ This means that by default the backend parameter of the ``communication_options `` for
319
+ ``pytorchddp `` distribution is set to ``auto ``. The goal of the automated backend
320
+ selection is to abstract away the overhead of deciding the right communication backend
321
+ for your training job, while ensuring that you get the best possible training performance
322
+ at a lower cost.
323
+
324
+ The following code shows an example of setting up a PyTorch estimator
325
+ to run a PyTorch DDP training job on two ``ml.p4d.24xlarge `` instances
326
+ with the communication backed set to ``auto ``.
327
+
328
+ .. code :: python
329
+
330
+ from sagemaker.pytorch import PyTorch
331
+
332
+ pt_estimator = PyTorch(
333
+ entry_point = " train_ptddp.py" ,
334
+ role = " SageMakerRole" ,
335
+ framework_version = " 1.12.1" ,
336
+ py_version = " py38" ,
337
+ instance_count = 2 ,
338
+ instance_type = " ml.p4d.24xlarge" ,
339
+ distribution = {
340
+ " pytorchddp" : {
341
+ " enabled" : True ,
342
+ " communication_options" : { # Optional. This is set to auto by default.
343
+ " backend" : " auto"
344
+ }
345
+ }
346
+ }
347
+ )
348
+
349
+ pt_estimator.fit(" s3://bucket/path/to/training/data" )
350
+
351
+ If you want to turn this automatic backend selection off and force to use NCCL,
352
+ you can explicitly set ``"backend": "nccl" `` as the communication backend.
353
+
354
+ -----
355
+
295
356
.. _distributed-pytorch-training-on-trainium :
296
357
297
358
Distributed Training with PyTorch Neuron on Trn1 instances
@@ -407,6 +468,8 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
407
468
408
469
pt_estimator.fit(" s3://bucket/path/to/training/data" )
409
470
471
+ -----
472
+
410
473
*********************
411
474
Deploy PyTorch Models
412
475
*********************
0 commit comments