documentation: make changes in the pytorch trn1 distributed training doc

mchoi8739 · mchoi8739 · commit c23353a9a904 · 2022-10-21T15:09:27.000-07:00
diff --git a/doc/frameworks/pytorch/using_pytorch.rst b/doc/frameworks/pytorch/using_pytorch.rst
@@ -295,22 +295,36 @@ using two ``ml.p4d.24xlarge`` instances:
 
 .. _distributed-pytorch-training-on-trainium:
 
-Distributed PyTorch Training on Trainium
-========================================
-
-SageMaker Training on Trainium instances now supports the ``xla``
-package through ``torchrun``. With this, you do not need to manually pass RANK,
-WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
+Distributed Training with PyTorch Neuron on Trn1 instances
+==========================================================
+
+SageMaker Training supports Amazon EC2 Trn1 instances powered by
+`AWS Trainium <https://aws.amazon.com/machine-learning/trainium/>`_ device,
+the second generation purpose-built machine learning accelerator from AWS.
+Each Trn1 instance consists of up to 16 Trainium devices, and each
+Trainium device consists of two `NeuronCores
+<https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#trainium-architecture>`_
+in the *AWS Neuron Documentation*.
+
+You can run distributed training job on Trn1 instances.
+SageMaker supports the ``xla`` package through ``torchrun``.
+With this, you do not need to manually pass ``RANK``,
+``WORLD_SIZE``, ``MASTER_ADDR``, and ``MASTER_PORT``.
+You can launch the training job using the
 :class:`sagemaker.pytorch.estimator.PyTorch` estimator class
 with the ``torch_distributed`` option as the distribution strategy.
 
 .. note::
 
   This ``torch_distributed`` support is available
-  in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
-  To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
+  in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0.
+  To find a complete list of supported versions of PyTorch Neuron, see
+  `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_
+  in the *AWS Deep Learning Containers GitHub repository*.
+
+.. note::
 
-  SageMaker Debugger and Profiler are currently not supported with Trainium instances.
+  SageMaker Debugger is currently not supported with Trn1 instances.
 
 Adapt Your Training Script to Initialize with the XLA backend
 -------------------------------------------------------------
@@ -334,12 +348,12 @@ For detailed documentation about modifying your training script for Trainium, se
 
 -  ``xla`` for Trainium (Trn1) instances
 
-For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
+For up-to-date information on supported backends for Trn1 instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
 
 Launching a Distributed Training Job on Trainium
 ------------------------------------------------
 
-You can run multi-node distributed PyTorch training jobs on Trainium instances using the
+You can run multi-node distributed PyTorch training jobs on Trn1 instances using the
 :class:`sagemaker.pytorch.estimator.PyTorch` estimator class.
 With ``instance_count=1``, the estimator submits a
 single-node training job to SageMaker; with ``instance_count`` greater
@@ -359,7 +373,7 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
     from sagemaker.pytorch import PyTorch
 
     pt_estimator = PyTorch(
-        entry_point="train_ptddp.py",
+        entry_point="train_torch_distributed.py",
         role="SageMakerRole",
         framework_version="1.11.0",
         py_version="py38",
@@ -379,7 +393,7 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
     from sagemaker.pytorch import PyTorch
 
     pt_estimator = PyTorch(
-        entry_point="train_ptddp.py",
+        entry_point="train_torch_distributed.py",
         role="SageMakerRole",
         framework_version="1.11.0",
         py_version="py38",