Skip to content

Commit 7a4d2f5

Browse files
authored
Merge branch 'master' into pr-framework-processor
2 parents 3a1907f + a058347 commit 7a4d2f5

File tree

65 files changed

+1268
-699
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1268
-699
lines changed

CHANGELOG.md

+85
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,90 @@
11
# Changelog
22

3+
## v2.38.0 (2021-04-21)
4+
5+
### Features
6+
7+
* support multiprocess feature group ingest (#2111)
8+
9+
## v2.37.0 (2021-04-20)
10+
11+
### Features
12+
13+
* add experiment_config for clarify processing job
14+
15+
### Documentation Changes
16+
17+
* release notes for smdistributed.dataparallel v1.1.2
18+
19+
## v2.36.0 (2021-04-19)
20+
21+
### Features
22+
23+
* enable smdataparallel custom mpi options support
24+
25+
## v2.35.0 (2021-04-14)
26+
27+
### Features
28+
29+
* add support for PyTorch 1.8.1
30+
31+
### Bug Fixes and Other Changes
32+
33+
* boto3 client param updated for feature store
34+
* Updated release notes and API doc for smd model parallel 1.3.1
35+
36+
## v2.34.0 (2021-04-12)
37+
38+
### Features
39+
40+
* Add support for accelerator in Clarify
41+
42+
### Bug Fixes and Other Changes
43+
44+
* add Documentation for how to use
45+
* enable local mode tests that were skipped
46+
* add integ test for HuggingFace with TensorFlow
47+
48+
### Documentation Changes
49+
50+
* release notes for smdistributed.dataparallel v1.1.1
51+
* fixing the SageMaker distributed version references
52+
53+
### Testing and Release Infrastructure
54+
55+
* pin version for ducutils
56+
57+
## v2.33.0 (2021-04-05)
58+
59+
### Features
60+
61+
* Add environment variable support for SageMaker training job
62+
63+
### Bug Fixes and Other Changes
64+
65+
* add version length mismatch validation for HuggingFace
66+
* Disable debugger when checkpointing is enabled with distributed training
67+
* map user context is list associations response
68+
69+
### Testing and Release Infrastructure
70+
71+
* disable_profiler on mx-horovod test
72+
73+
## v2.32.1 (2021-04-01)
74+
75+
### Bug Fixes and Other Changes
76+
77+
* disable profiler in some release tests
78+
* remove outdated notebook from test
79+
* add compilation option for ml_eia2
80+
* add short version to smdataparallel supported list
81+
82+
### Documentation Changes
83+
84+
* creating a "latest" version sm distributed docs
85+
* add docs for Sagemaker Model Parallel 1.3, released with PT 1.8
86+
* update PyTorch version in doc
87+
388
## v2.32.0 (2021-03-26)
489

590
### Features

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.32.1.dev0
1+
2.38.1.dev0

doc/amazon_sagemaker_featurestore.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ use the SageMaker default bucket and add a custom prefix to it.
6767
offline_feature_store_bucket = 's3://*{}*/*{}*'.format(default_bucket, prefix)
6868
6969
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
70-
featurestore_runtime = boto_session.client(service_name='featurestore-runtime', region_name=region)
70+
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)
7171
7272
feature_store_session = Session(
7373
    boto_session=boto_session,
+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
2+
Version 1.1.2 (Latest)
3+
======================
4+
5+
.. toctree::
6+
:maxdepth: 1
7+
8+
latest/smd_data_parallel_pytorch.rst
9+
latest/smd_data_parallel_tensorflow.rst

doc/api/training/sdp_versions/v1.1.0/smd_data_parallel_pytorch.rst renamed to doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -153,9 +153,9 @@ you will have for distributed training with the distributed data parallel librar
153153
PyTorch API
154154
===========
155155

156-
**Supported versions:**
156+
.. rubric:: Supported versions
157157

158-
- PyTorch 1.6.0, 1.8.0
158+
**PyTorch 1.7.1, 1.8.0**
159159

160160

161161
.. function:: smdistributed.dataparallel.torch.distributed.is_available()

doc/api/training/sdp_versions/v1.1.0/smd_data_parallel_tensorflow.rst renamed to doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

+7-4
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ The following steps show you how to convert a TensorFlow 2.x training
1616
script to utilize the distributed data parallel library.
1717

1818
The distributed data parallel library APIs are designed to be close to Horovod APIs.
19-
See `SageMaker distributed data parallel TensorFlow examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__ for additional details on how to implement the data parallel library
20-
API offered for TensorFlow.
19+
See `SageMaker distributed data parallel TensorFlow examples
20+
<https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__
21+
for additional details on how to implement the data parallel library.
2122

2223
- First import the distributed data parallel library’s TensorFlow client and initialize it:
2324

@@ -156,8 +157,10 @@ TensorFlow API
156157

157158
.. rubric:: Supported versions
158159

159-
- TensorFlow 2.x - 2.3.1
160-
160+
TensorFlow is supported in version 1.0.0 of ``sagemakerdistributed.dataparallel``.
161+
Reference version 1.0.0 `TensorFlow API documentation
162+
<https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.html#tensorflow-sdp-api>`_
163+
for supported TensorFlow versions.
161164

162165
.. function:: smdistributed.dataparallel.tensorflow.init()
163166

doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_pytorch.rst

+6-8
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ PyTorch Guide to SageMaker's distributed data parallel library
44

55
.. admonition:: Contents
66

7-
- :ref:`pytorch-sdp-modify`
8-
- :ref:`pytorch-sdp-api`
7+
- :ref:`pytorch-sdp-modify-1.0.0`
8+
- :ref:`pytorch-sdp-api-1.0.0`
99

10-
.. _pytorch-sdp-modify:
11-
:noindex:
10+
.. _pytorch-sdp-modify-1.0.0:
1211

1312
Modify a PyTorch training script to use SageMaker data parallel
1413
======================================================================
@@ -149,15 +148,14 @@ you will have for distributed training with the distributed data parallel librar
149148
    main()
150149
151150
152-
.. _pytorch-sdp-api:
153-
:noindex:
151+
.. _pytorch-sdp-api-1.0.0:
154152

155153
PyTorch API
156154
===========
157155

158-
**Supported versions:**
156+
.. rubric:: Supported versions
159157

160-
- PyTorch 1.6.0
158+
**PyTorch 1.6.0, 1.7.1**
161159

162160

163161
.. function:: smdistributed.dataparallel.torch.distributed.is_available()

doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_tensorflow.rst

+5-7
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ TensorFlow Guide to SageMaker's distributed data parallel library
44

55
.. admonition:: Contents
66

7-
- :ref:`tensorflow-sdp-modify`
8-
- :ref:`tensorflow-sdp-api`
7+
- :ref:`tensorflow-sdp-modify-1.0.0`
8+
- :ref:`tensorflow-sdp-api-1.0.0`
99

10-
.. _tensorflow-sdp-modify:
11-
:noindex:
10+
.. _tensorflow-sdp-modify-1.0.0:
1211

1312
Modify a TensorFlow 2.x training script to use SageMaker data parallel
1413
======================================================================
@@ -150,15 +149,14 @@ script you will have for distributed training with the library.
150149
    checkpoint.save(checkpoint_dir)
151150
152151
153-
.. _tensorflow-sdp-api:
154-
:noindex:
152+
.. _tensorflow-sdp-api-1.0.0:
155153

156154
TensorFlow API
157155
==============
158156

159157
.. rubric:: Supported versions
160158

161-
- TensorFlow 2.x - 2.3.1
159+
**TensorFlow 2.3.x - 2.4.1**
162160

163161

164162
.. function:: smdistributed.dataparallel.tensorflow.init()

doc/api/training/sdp_versions/v1_1_0.rst

-9
This file was deleted.

doc/api/training/smd_data_parallel.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ Select a version to see the API documentation for version.
8484
.. toctree::
8585
:maxdepth: 1
8686

87-
sdp_versions/v1_1_0.rst
87+
sdp_versions/latest.rst
8888
sdp_versions/v1_0_0.rst
8989

9090
.. important::

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md

+35-4
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,54 @@
1+
# Sagemaker Distributed Data Parallel 1.1.2 Release Notes
2+
3+
* Bug Fixes
4+
* Known Issues
5+
6+
*Bug Fixes:*
7+
8+
* Fixed a bug that caused some TensorFlow operations to not work with certain data types. Operations forwarded from C++ have been extended to support every dtype supported by NCCL.
9+
10+
*Known Issues:*
11+
12+
* SageMaker distributed data parallel has slower throughput than NCCL when run using a single node. For the best performance, use multi-node distributed training with smdistributed.dataparallel. Use a single node only for experimental runs while preparing your training pipeline.
13+
14+
# Sagemaker Distributed Data Parallel 1.1.1 Release Notes
15+
16+
* New Features
17+
* Bug Fixes
18+
* Known Issues
19+
20+
*New Features:*
21+
22+
* Adds support for PyTorch 1.8.1
23+
24+
*Bug Fixes:*
25+
26+
* Fixes a bug that was causing gradients from one of the worker nodes to be added twice resulting in incorrect `all_reduce` results under some conditions.
27+
28+
*Known Issues:*
29+
30+
* SageMaker distributed data parallel still is not efficient when run using a single node. For the best performance, use multi-node distributed training with `smdistributed.dataparallel`. Use a single node only for experimental runs while preparing your training pipeline.
31+
132
# Sagemaker Distributed Data Parallel 1.1.0 Release Notes
233

334
* New Features
435
* Bug Fixes
536
* Improvements
637
* Known Issues
738

8-
New Features:
39+
*New Features:*
940

1041
* Adds support for PyTorch 1.8.0 with CUDA 11.1 and CUDNN 8
1142

12-
Bug Fixes:
43+
*Bug Fixes:*
1344

1445
* Fixes crash issue when importing `smdataparallel` before PyTorch
1546

16-
Improvements:
47+
*Improvements:*
1748

1849
* Update `smdataparallel` name in python packages, descriptions, and log outputs
1950

20-
Known Issues:
51+
*Known Issues:*
2152

2253
* SageMaker DataParallel is not efficient when run using a single node. For the best performance, use multi-node distributed training with `smdataparallel`. Use a single node only for experimental runs while preparing your training pipeline.
2354

doc/api/training/smd_model_parallel.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Select a version to see the API documentation for version. To use the library, r
3434
.. toctree::
3535
:maxdepth: 1
3636

37-
smp_versions/v1_3_0.rst
37+
smp_versions/latest.rst
3838
smp_versions/v1_2_0.rst
3939
smp_versions/v1_1_0.rst
4040

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

+30
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,33 @@
1+
# Sagemaker Distributed Model Parallel 1.3.1 Release Notes
2+
3+
- New Features
4+
- Bug Fixes
5+
- Known Issues
6+
7+
## New Features
8+
9+
### TensorFlow
10+
11+
- Exposes a new decorator ``register_post_partition_hook``. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html) for more information.
12+
13+
## Bug Fixes
14+
15+
### PyTorch
16+
17+
- Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch.
18+
19+
### TensorFlow
20+
21+
- Fixed issue that caused hangs when training some models with XLA enabled.
22+
23+
## Known Issues
24+
25+
### PyTorch
26+
27+
- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
28+
29+
- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
30+
131
# Sagemaker Distributed Model Parallel 1.3.0 Release Notes
232

333
- New Features

doc/api/training/smp_versions/v1_3_0.rst renamed to doc/api/training/smp_versions/latest.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,6 @@ To use the library, reference the Common API documentation alongside the framewo
77
.. toctree::
88
:maxdepth: 1
99

10-
v1.3.0/smd_model_parallel_common_api
11-
v1.3.0/smd_model_parallel_pytorch
12-
v1.3.0/smd_model_parallel_tensorflow
10+
latest/smd_model_parallel_common_api
11+
latest/smd_model_parallel_pytorch
12+
latest/smd_model_parallel_tensorflow

doc/api/training/smp_versions/v1.3.0/smd_model_parallel_pytorch.rst renamed to doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
PyTorch API
77
===========
88

9-
**Supported versions: 1.7.1, 1.8.0**
9+
**Supported versions: 1.6.0, 1.7.1, 1.8.0**
1010

1111
This API document assumes you use the following import statements in your training scripts.
1212

0 commit comments

Comments
 (0)