You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SageMaker Training on Trainium instances now supports the ``xla``
302
+
package through ``torchrun``. With this, you do not need to manually pass RANK,
303
+
WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
304
+
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
305
+
with the ``torch_distributed`` option as the distribution strategy.
306
+
307
+
.. note::
308
+
309
+
This ``torch_distributed`` support is available
310
+
in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
311
+
To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
312
+
313
+
SageMaker Debugger and Profiler are currently not supported with Trainium instances.
314
+
315
+
Adapt Your Training Script to Initialize with the XLA backend
SageMaker takes care of ``'MASTER_ADDR'`` and ``'MASTER_PORT'`` for you via ``torchrun``
330
+
331
+
For detailed documentation about modifying your training script for Trainium, see `Multi-worker data-parallel MLP training using torchrun <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun>`_ in the *AWS Neuron Documentation*.
332
+
333
+
**Currently Supported backends:**
334
+
335
+
- ``xla`` for Trainium (Trn1) instances
336
+
337
+
For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
338
+
339
+
Launching a Distributed Training Job on Trainium
340
+
------------------------------------------------
341
+
342
+
You can run multi-node distributed PyTorch training jobs on Trainium instances using the
0 commit comments