Skip to content

Commit c23353a

Browse files
committed
documentation: make changes in the pytorch trn1 distributed training doc
1 parent 1a39422 commit c23353a

File tree

1 file changed

+27
-13
lines changed

1 file changed

+27
-13
lines changed

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 27 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -295,22 +295,36 @@ using two ``ml.p4d.24xlarge`` instances:
295295
296296
.. _distributed-pytorch-training-on-trainium:
297297

298-
Distributed PyTorch Training on Trainium
299-
========================================
300-
301-
SageMaker Training on Trainium instances now supports the ``xla``
302-
package through ``torchrun``. With this, you do not need to manually pass RANK,
303-
WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
298+
Distributed Training with PyTorch Neuron on Trn1 instances
299+
==========================================================
300+
301+
SageMaker Training supports Amazon EC2 Trn1 instances powered by
302+
`AWS Trainium <https://aws.amazon.com/machine-learning/trainium/>`_ device,
303+
the second generation purpose-built machine learning accelerator from AWS.
304+
Each Trn1 instance consists of up to 16 Trainium devices, and each
305+
Trainium device consists of two `NeuronCores
306+
<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#trainium-architecture>`_
307+
in the *AWS Neuron Documentation*.
308+
309+
You can run distributed training job on Trn1 instances.
310+
SageMaker supports the ``xla`` package through ``torchrun``.
311+
With this, you do not need to manually pass ``RANK``,
312+
``WORLD_SIZE``, ``MASTER_ADDR``, and ``MASTER_PORT``.
313+
You can launch the training job using the
304314
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
305315
with the ``torch_distributed`` option as the distribution strategy.
306316

307317
.. note::
308318

309319
This ``torch_distributed`` support is available
310-
in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
311-
To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
320+
in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0.
321+
To find a complete list of supported versions of PyTorch Neuron, see
322+
`Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_
323+
in the *AWS Deep Learning Containers GitHub repository*.
324+
325+
.. note::
312326

313-
SageMaker Debugger and Profiler are currently not supported with Trainium instances.
327+
SageMaker Debugger is currently not supported with Trn1 instances.
314328

315329
Adapt Your Training Script to Initialize with the XLA backend
316330
-------------------------------------------------------------
@@ -334,12 +348,12 @@ For detailed documentation about modifying your training script for Trainium, se
334348

335349
- ``xla`` for Trainium (Trn1) instances
336350

337-
For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
351+
For up-to-date information on supported backends for Trn1 instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
338352

339353
Launching a Distributed Training Job on Trainium
340354
------------------------------------------------
341355

342-
You can run multi-node distributed PyTorch training jobs on Trainium instances using the
356+
You can run multi-node distributed PyTorch training jobs on Trn1 instances using the
343357
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class.
344358
With ``instance_count=1``, the estimator submits a
345359
single-node training job to SageMaker; with ``instance_count`` greater
@@ -359,7 +373,7 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
359373
from sagemaker.pytorch import PyTorch
360374
361375
pt_estimator = PyTorch(
362-
entry_point="train_ptddp.py",
376+
entry_point="train_torch_distributed.py",
363377
role="SageMakerRole",
364378
framework_version="1.11.0",
365379
py_version="py38",
@@ -379,7 +393,7 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
379393
from sagemaker.pytorch import PyTorch
380394
381395
pt_estimator = PyTorch(
382-
entry_point="train_ptddp.py",
396+
entry_point="train_torch_distributed.py",
383397
role="SageMakerRole",
384398
framework_version="1.11.0",
385399
py_version="py38",

0 commit comments

Comments
 (0)