Skip to content

Commit bc5f8e1

Browse files
authored
Merge branch 'master' into trainium-new
2 parents 5e6372c + 5adc2d3 commit bc5f8e1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+2462
-114
lines changed

.gitignore

+3-1
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,6 @@ venv/
2929
env/
3030
.vscode/
3131
**/tmp
32-
.python-version
32+
.python-version
33+
**/_repack_model.py
34+
**/_repack_script_launcher.sh

CHANGELOG.md

+15
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,20 @@
11
# Changelog
22

3+
## v2.113.0 (2022-10-21)
4+
5+
### Features
6+
7+
* support torch_distributed distribution for Trainium instances
8+
9+
### Bug Fixes and Other Changes
10+
11+
* bump apache-airflow from 2.4.0 to 2.4.1 in /requirements/extras
12+
13+
### Documentation Changes
14+
15+
* fix kwargs and descriptions of the smdmp checkpoint function
16+
* add the doc for the MonitorBatchTransformStep
17+
318
## v2.112.2 (2022-10-11)
419

520
### Bug Fixes and Other Changes

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.112.3.dev0
1+
2.113.1.dev0

doc/frameworks/pytorch/using_pytorch.rst

+101-1
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,8 @@ during the PyTorch DDP initialization.
262262

263263
.. note::
264264

265-
The SageMaker PyTorch estimator doesn’t use ``torchrun`` for distributed training.
265+
The SageMaker PyTorch estimator operates ``mpirun`` in the backend.
266+
It doesn’t use ``torchrun`` for distributed training.
266267

267268
For more information about setting up PyTorch DDP in your training script,
268269
see `Getting Started with Distributed Data Parallel
@@ -292,7 +293,106 @@ using two ``ml.p4d.24xlarge`` instances:
292293
293294
pt_estimator.fit("s3://bucket/path/to/training/data")
294295
296+
.. _distributed-pytorch-training-on-trainium:
295297

298+
Distributed PyTorch Training on Trainium
299+
========================================
300+
301+
SageMaker Training on Trainium instances now supports the ``xla``
302+
package through ``torchrun``. With this, you do not need to manually pass RANK,
303+
WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
304+
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
305+
with the ``torch_distributed`` option as the distribution strategy.
306+
307+
.. note::
308+
309+
This ``torch_distributed`` support is available
310+
in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
311+
To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
312+
313+
SageMaker Debugger and Profiler are currently not supported with Trainium instances.
314+
315+
Adapt Your Training Script to Initialize with the XLA backend
316+
-------------------------------------------------------------
317+
318+
To initialize distributed training in your script, call
319+
`torch.distributed.init_process_group
320+
<https://pytorch.org/docs/master/distributed.html#torch.distributed.init_process_group>`_
321+
with the ``xla`` backend as shown below.
322+
323+
.. code:: python
324+
325+
import torch.distributed as dist
326+
327+
dist.init_process_group('xla')
328+
329+
SageMaker takes care of ``'MASTER_ADDR'`` and ``'MASTER_PORT'`` for you via ``torchrun``
330+
331+
For detailed documentation about modifying your training script for Trainium, see `Multi-worker data-parallel MLP training using torchrun <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun>`_ in the *AWS Neuron Documentation*.
332+
333+
**Currently Supported backends:**
334+
335+
- ``xla`` for Trainium (Trn1) instances
336+
337+
For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
338+
339+
Launching a Distributed Training Job on Trainium
340+
------------------------------------------------
341+
342+
You can run multi-node distributed PyTorch training jobs on Trainium instances using the
343+
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class.
344+
With ``instance_count=1``, the estimator submits a
345+
single-node training job to SageMaker; with ``instance_count`` greater
346+
than one, a multi-node training job is launched.
347+
348+
With the ``torch_distributed`` option, the SageMaker PyTorch estimator runs a SageMaker
349+
training container for PyTorch Neuron, sets up the environment, and launches
350+
the training job using the ``torchrun`` command on each worker with the given information.
351+
352+
**Examples**
353+
354+
The following examples show how to run a PyTorch training using ``torch_distributed`` in SageMaker
355+
on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
356+
357+
.. code:: python
358+
359+
from sagemaker.pytorch import PyTorch
360+
361+
pt_estimator = PyTorch(
362+
entry_point="train_ptddp.py",
363+
role="SageMakerRole",
364+
framework_version="1.11.0",
365+
py_version="py38",
366+
instance_count=1,
367+
instance_type="ml.trn1.2xlarge",
368+
distribution={
369+
"torch_distributed": {
370+
"enabled": True
371+
}
372+
}
373+
)
374+
375+
pt_estimator.fit("s3://bucket/path/to/training/data")
376+
377+
.. code:: python
378+
379+
from sagemaker.pytorch import PyTorch
380+
381+
pt_estimator = PyTorch(
382+
entry_point="train_ptddp.py",
383+
role="SageMakerRole",
384+
framework_version="1.11.0",
385+
py_version="py38",
386+
instance_count=2,
387+
instance_type="ml.trn1.32xlarge",
388+
distribution={
389+
"torch_distributed": {
390+
"enabled": True
391+
}
392+
}
393+
)
394+
395+
pt_estimator.fit("s3://bucket/path/to/training/data")
296396
297397
*********************
298398
Deploy PyTorch Models

requirements/extras/test_requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ contextlib2==21.6.0
1111
awslogs==0.14.0
1212
black==22.3.0
1313
stopit==1.1.2
14-
apache-airflow==2.4.0
14+
apache-airflow==2.4.1
1515
apache-airflow-providers-amazon==4.0.0
1616
attrs==22.1.0
1717
fabric==2.6.0

0 commit comments

Comments
 (0)