You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can run distributed training job on Trn1 instances.
310
+
SageMaker supports the ``xla`` package through ``torchrun``.
311
+
With this, you do not need to manually pass ``RANK``,
312
+
``WORLD_SIZE``, ``MASTER_ADDR``, and ``MASTER_PORT``.
313
+
You can launch the training job using the
304
314
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
305
315
with the ``torch_distributed`` option as the distribution strategy.
306
316
307
317
.. note::
308
318
309
319
This ``torch_distributed`` support is available
310
-
in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
311
-
To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
320
+
in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0.
321
+
To find a complete list of supported versions of PyTorch Neuron, see
@@ -334,12 +348,12 @@ For detailed documentation about modifying your training script for Trainium, se
334
348
335
349
- ``xla`` for Trainium (Trn1) instances
336
350
337
-
For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
351
+
For up-to-date information on supported backends for Trn1 instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
338
352
339
353
Launching a Distributed Training Job on Trainium
340
354
------------------------------------------------
341
355
342
-
You can run multi-node distributed PyTorch training jobs on Trainium instances using the
356
+
You can run multi-node distributed PyTorch training jobs on Trn1 instances using the
0 commit comments