@@ -6,39 +6,36 @@ SageMaker's distributed data parallel library extends SageMaker’s training
6
6
capabilities on deep learning models with near-linear scaling efficiency,
7
7
achieving fast time-to-train with minimal code changes.
8
8
9
- - optimizes your training job for AWS network infrastructure and EC2 instance topology.
10
- - takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
11
-
12
9
When training a model on a large amount of data, machine learning practitioners
13
10
will often turn to distributed training to reduce the time to train.
14
11
In some cases, where time is of the essence,
15
12
the business requirement is to finish training as quickly as possible or at
16
13
least within a constrained time period.
17
14
Then, distributed training is scaled to use a cluster of multiple nodes,
18
15
meaning not just multiple GPUs in a computing instance, but multiple instances
19
- with multiple GPUs. As the cluster size increases, so does the significant drop
20
- in performance. This drop in performance is primarily caused the communications
21
- overhead between nodes in a cluster.
16
+ with multiple GPUs. However, as the cluster size increases, it is possible to see a significant drop
17
+ in performance due to communications overhead between nodes in a cluster.
22
18
23
- .. important ::
24
- The distributed data parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
25
- ``Estimator `` with ``dataparallel `` parameter ``enabled `` set to ``True ``,
26
- it uses CUDA 11. When you extend or customize your own training image
27
- you must use a CUDA 11 base image. See
28
- `SageMaker Python SDK's distributed data parallel library APIs
29
- <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api> `__
30
- for more information.
19
+ SageMaker's distributed data parallel library addresses communications overhead in two ways:
31
20
32
- .. rubric :: Customize your training script
21
+ 1. The library performs AllReduce, a key operation during distributed training that is responsible for a
22
+ large portion of communication overhead.
23
+ 2. The library performs optimized node-to-node communication by fully utilizing AWS’s network
24
+ infrastructure and Amazon EC2 instance topology.
33
25
34
- To customize your own training script, you will need the following:
26
+ To learn more about the core features of this library, see
27
+ `Introduction to SageMaker's Distributed Data Parallel Library
28
+ <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html> `_
29
+ in the SageMaker Developer Guide.
35
30
36
- .. raw :: html
31
+ Use with the SageMaker Python SDK
32
+ =================================
37
33
38
- < div data-section-style = " 5 " style = " " >
34
+ To use the SageMaker distributed data parallel library with the SageMaker Python SDK, you will need the following:
39
35
40
- - You must provide TensorFlow / PyTorch training scripts that are
41
- adapted to use the distributed data parallel library.
36
+ - A TensorFlow or PyTorch training script that is
37
+ adapted to use the distributed data parallel library. The :ref: `sdp_api_docs ` includes
38
+ framework specific examples of training scripts that are adapted to use this library.
42
39
- Your input data must be in an S3 bucket or in FSx in the AWS region
43
40
that you will use to launch your training job. If you use the Jupyter
44
41
notebooks provided, create a SageMaker notebook instance in the same
@@ -47,26 +44,63 @@ To customize your own training script, you will need the following:
47
44
the `SageMaker Python SDK data
48
45
inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs> `__ documentation.
49
46
50
- .. raw :: html
47
+ When you define
48
+ a Pytorch or TensorFlow ``Estimator `` using the SageMaker Python SDK,
49
+ you must select ``dataparallel `` as your ``distribution `` strategy:
50
+
51
+ .. code ::
52
+
53
+ distribution = { "smdistributed": { "dataparallel": { "enabled": True } } }
54
+
55
+ We recommend you use one of the example notebooks as your template to launch a training job. When
56
+ you use an example notebook you’ll need to swap your training script with the one that came with the
57
+ notebook and modify any input functions as necessary. For instructions on how to get started using a
58
+ Jupyter Notebook example, see `Distributed Training Jupyter Notebook Examples
59
+ <https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training-notebook-examples.html> `_.
60
+
61
+ Once you have launched a training job, you can monitor it using CloudWatch. To learn more, see
62
+ `Monitor and Analyze Training Jobs Using Metrics
63
+ <https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html> `_.
64
+
51
65
52
- </div >
66
+ After you train a model, you can see how to deploy your trained model to an endpoint for inference by
67
+ following one of the `example notebooks for deploying a model
68
+ <https://sagemaker-examples.readthedocs.io/en/latest/inference/index.html> `_.
69
+ For more information, see `Deploy Models for Inference
70
+ <https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html> `_.
53
71
54
- Use the API guides for each framework to see
55
- examples of training scripts that can be used to convert your training scripts.
56
- Then use one of the example notebooks as your template to launch a training job.
57
- You’ll need to swap your training script with the one that came with the
58
- notebook and modify any input functions as necessary.
59
- Once you have launched a training job, you can monitor it using CloudWatch.
72
+ .. _sdp_api_docs :
60
73
61
- Then you can see how to deploy your trained model to an endpoint by
62
- following one of the example notebooks for deploying a model. Finally,
63
- you can follow an example notebook to test inference on your deployed
64
- model.
74
+ API Documentation
75
+ =================
65
76
77
+ This section contains the SageMaker distributed data parallel API documentation. If you are a
78
+ new user of this library, it is recommended you use this guide alongside
79
+ `SageMaker's Distributed Data Parallel Library
80
+ <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html> `_.
66
81
82
+ Select a version to see the API documentation for version.
67
83
68
84
.. toctree ::
69
- :maxdepth: 2
85
+ :maxdepth: 1
86
+
87
+ sdp_versions/v1_0_0.rst
88
+
89
+ .. important ::
90
+ The distributed data parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
91
+ ``Estimator `` with ``dataparallel `` parameter ``enabled `` set to ``True ``,
92
+ it uses CUDA 11. When you extend or customize your own training image
93
+ you must use a CUDA 11 base image. See
94
+ `SageMaker Python SDK's distributed data parallel library APIs
95
+ <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api> `_
96
+ for more information.
97
+
98
+
99
+ Release Notes
100
+ =============
101
+
102
+ New features, bug fixes, and improvements are regularly made to the SageMaker distributed data parallel library.
70
103
71
- sdp_versions/smd_data_parallel_pytorch
72
- sdp_versions/smd_data_parallel_tensorflow
104
+ To see the the latest changes made to the library, refer to the library
105
+ `Release Notes
106
+ <https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_data_parallel_release_notes/> `_.
0 commit comments