diff --git a/doc/frameworks/pytorch/using_pytorch.rst b/doc/frameworks/pytorch/using_pytorch.rst index 6281de2635..725f34aa5a 100644 --- a/doc/frameworks/pytorch/using_pytorch.rst +++ b/doc/frameworks/pytorch/using_pytorch.rst @@ -293,107 +293,6 @@ using two ``ml.p4d.24xlarge`` instances: pt_estimator.fit("s3://bucket/path/to/training/data") -.. _distributed-pytorch-training-on-trainium: - -Distributed PyTorch Training on Trainium -======================================== - -SageMaker Training on Trainium instances now supports the ``xla`` -package through ``torchrun``. With this, you do not need to manually pass RANK, -WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the -:class:`sagemaker.pytorch.estimator.PyTorch` estimator class -with the ``torch_distributed`` option as the distribution strategy. - -.. note:: - - This ``torch_distributed`` support is available - in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0. - To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers `_ in the *AWS Deep Learning Containers GitHub repository*. - - SageMaker Debugger and Profiler are currently not supported with Trainium instances. - -Adapt Your Training Script to Initialize with the XLA backend -------------------------------------------------------------- - -To initialize distributed training in your script, call -`torch.distributed.init_process_group -`_ -with the ``xla`` backend as shown below. - -.. code:: python - - import torch.distributed as dist - - dist.init_process_group('xla') - -SageMaker takes care of ``'MASTER_ADDR'`` and ``'MASTER_PORT'`` for you via ``torchrun`` - -For detailed documentation about modifying your training script for Trainium, see `Multi-worker data-parallel MLP training using torchrun `_ in the *AWS Neuron Documentation*. - -**Currently Supported backends:** - -- ``xla`` for Trainium (Trn1) instances - -For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation `_. - -Launching a Distributed Training Job on Trainium ------------------------------------------------- - -You can run multi-node distributed PyTorch training jobs on Trainium instances using the -:class:`sagemaker.pytorch.estimator.PyTorch` estimator class. -With ``instance_count=1``, the estimator submits a -single-node training job to SageMaker; with ``instance_count`` greater -than one, a multi-node training job is launched. - -With the ``torch_distributed`` option, the SageMaker PyTorch estimator runs a SageMaker -training container for PyTorch Neuron, sets up the environment, and launches -the training job using the ``torchrun`` command on each worker with the given information. - -**Examples** - -The following examples show how to run a PyTorch training using ``torch_distributed`` in SageMaker -on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances: - -.. code:: python - - from sagemaker.pytorch import PyTorch - - pt_estimator = PyTorch( - entry_point="train_ptddp.py", - role="SageMakerRole", - framework_version="1.11.0", - py_version="py38", - instance_count=1, - instance_type="ml.trn1.2xlarge", - distribution={ - "torch_distributed": { - "enabled": True - } - } - ) - - pt_estimator.fit("s3://bucket/path/to/training/data") - -.. code:: python - - from sagemaker.pytorch import PyTorch - - pt_estimator = PyTorch( - entry_point="train_ptddp.py", - role="SageMakerRole", - framework_version="1.11.0", - py_version="py38", - instance_count=2, - instance_type="ml.trn1.32xlarge", - distribution={ - "torch_distributed": { - "enabled": True - } - } - ) - - pt_estimator.fit("s3://bucket/path/to/training/data") - ********************* Deploy PyTorch Models *********************