Skip to content

Commit 551566b

Browse files
authored
Merge branch 'master' into fix-cloudwatch-logs-local
2 parents d1efc85 + 7fec6c1 commit 551566b

18 files changed

+232
-568
lines changed

CHANGELOG.md

+16
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
# Changelog
22

3+
## v2.33.0 (2021-04-05)
4+
5+
### Features
6+
7+
* Add environment variable support for SageMaker training job
8+
9+
### Bug Fixes and Other Changes
10+
11+
* add version length mismatch validation for HuggingFace
12+
* Disable debugger when checkpointing is enabled with distributed training
13+
* map user context is list associations response
14+
15+
### Testing and Release Infrastructure
16+
17+
* disable_profiler on mx-horovod test
18+
319
## v2.32.1 (2021-04-01)
420

521
### Bug Fixes and Other Changes

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.32.2.dev0
1+
2.33.1.dev0

doc/api/training/sdp_versions/latest.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
Version 1.1.0 (Latest)
2+
Version 1.1.1 (Latest)
33
======================
44

55
.. toctree::

doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -153,9 +153,9 @@ you will have for distributed training with the distributed data parallel librar
153153
PyTorch API
154154
===========
155155

156-
**Supported versions:**
156+
.. rubric:: Supported versions
157157

158-
- PyTorch 1.6.0, 1.8.0
158+
**PyTorch 1.7.1, 1.8.0**
159159

160160

161161
.. function:: smdistributed.dataparallel.torch.distributed.is_available()

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

+7-4
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ The following steps show you how to convert a TensorFlow 2.x training
1616
script to utilize the distributed data parallel library.
1717

1818
The distributed data parallel library APIs are designed to be close to Horovod APIs.
19-
See `SageMaker distributed data parallel TensorFlow examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__ for additional details on how to implement the data parallel library
20-
API offered for TensorFlow.
19+
See `SageMaker distributed data parallel TensorFlow examples
20+
<https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__
21+
for additional details on how to implement the data parallel library.
2122

2223
- First import the distributed data parallel library’s TensorFlow client and initialize it:
2324

@@ -156,8 +157,10 @@ TensorFlow API
156157

157158
.. rubric:: Supported versions
158159

159-
- TensorFlow 2.x - 2.3.1
160-
160+
TensorFlow is supported in version 1.0.0 of ``sagemakerdistributed.dataparallel``.
161+
Reference version 1.0.0 `TensorFlow API documentation
162+
<https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.html#tensorflow-sdp-api>`_
163+
for supported TensorFlow versions.
161164

162165
.. function:: smdistributed.dataparallel.tensorflow.init()
163166

doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_pytorch.rst

+6-8
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ PyTorch Guide to SageMaker's distributed data parallel library
44

55
.. admonition:: Contents
66

7-
- :ref:`pytorch-sdp-modify`
8-
- :ref:`pytorch-sdp-api`
7+
- :ref:`pytorch-sdp-modify-1.0.0`
8+
- :ref:`pytorch-sdp-api-1.0.0`
99

10-
.. _pytorch-sdp-modify:
11-
:noindex:
10+
.. _pytorch-sdp-modify-1.0.0:
1211

1312
Modify a PyTorch training script to use SageMaker data parallel
1413
======================================================================
@@ -149,15 +148,14 @@ you will have for distributed training with the distributed data parallel librar
149148
    main()
150149
151150
152-
.. _pytorch-sdp-api:
153-
:noindex:
151+
.. _pytorch-sdp-api-1.0.0:
154152

155153
PyTorch API
156154
===========
157155

158-
**Supported versions:**
156+
.. rubric:: Supported versions
159157

160-
- PyTorch 1.6.0
158+
**PyTorch 1.6.0, 1.7.1**
161159

162160

163161
.. function:: smdistributed.dataparallel.torch.distributed.is_available()

doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_tensorflow.rst

+5-7
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ TensorFlow Guide to SageMaker's distributed data parallel library
44

55
.. admonition:: Contents
66

7-
- :ref:`tensorflow-sdp-modify`
8-
- :ref:`tensorflow-sdp-api`
7+
- :ref:`tensorflow-sdp-modify-1.0.0`
8+
- :ref:`tensorflow-sdp-api-1.0.0`
99

10-
.. _tensorflow-sdp-modify:
11-
:noindex:
10+
.. _tensorflow-sdp-modify-1.0.0:
1211

1312
Modify a TensorFlow 2.x training script to use SageMaker data parallel
1413
======================================================================
@@ -150,15 +149,14 @@ script you will have for distributed training with the library.
150149
    checkpoint.save(checkpoint_dir)
151150
152151
153-
.. _tensorflow-sdp-api:
154-
:noindex:
152+
.. _tensorflow-sdp-api-1.0.0:
155153

156154
TensorFlow API
157155
==============
158156

159157
.. rubric:: Supported versions
160158

161-
- TensorFlow 2.x - 2.3.1
159+
**TensorFlow 2.3.x - 2.4.1**
162160

163161

164162
.. function:: smdistributed.dataparallel.tensorflow.init()

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md

+22-4
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,41 @@
1+
# Sagemaker Distributed Data Parallel 1.1.1 Release Notes
2+
3+
* New Features
4+
* Bug Fixes
5+
* Known Issues
6+
7+
*New Features:*
8+
9+
* Adds support for PyTorch 1.8.1
10+
11+
*Bug Fixes:*
12+
13+
* Fixes a bug that was causing gradients from one of the worker nodes to be added twice resulting in incorrect `all_reduce` results under some conditions.
14+
15+
*Known Issues:*
16+
17+
* SageMaker distributed data parallel still is not efficient when run using a single node. For the best performance, use multi-node distributed training with `smdistributed.dataparallel`. Use a single node only for experimental runs while preparing your training pipeline.
18+
119
# Sagemaker Distributed Data Parallel 1.1.0 Release Notes
220

321
* New Features
422
* Bug Fixes
523
* Improvements
624
* Known Issues
725

8-
New Features:
26+
*New Features:*
927

1028
* Adds support for PyTorch 1.8.0 with CUDA 11.1 and CUDNN 8
1129

12-
Bug Fixes:
30+
*Bug Fixes:*
1331

1432
* Fixes crash issue when importing `smdataparallel` before PyTorch
1533

16-
Improvements:
34+
*Improvements:*
1735

1836
* Update `smdataparallel` name in python packages, descriptions, and log outputs
1937

20-
Known Issues:
38+
*Known Issues:*
2139

2240
* SageMaker DataParallel is not efficient when run using a single node. For the best performance, use multi-node distributed training with `smdataparallel`. Use a single node only for experimental runs while preparing your training pipeline.
2341

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
PyTorch API
77
===========
88

9-
**Supported versions: 1.7.1, 1.8.0**
9+
**Supported versions: 1.6.0, 1.7.1, 1.8.0**
1010

1111
This API document assumes you use the following import statements in your training scripts.
1212

doc/frameworks/huggingface/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ For general information about using the SageMaker Python SDK, see :ref:`overview
99
:maxdepth: 2
1010

1111
sagemaker.huggingface
12+
Use Hugging Face with the SageMaker Python SDK <https://huggingface.co/transformers/sagemaker.html>

doc/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
sphinx==3.1.1
22
sphinx-rtd-theme==0.5.0
3+
docutils==0.15.2

src/sagemaker/clarify.py

+6-3
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ def __init__(
123123
content_type=None,
124124
content_template=None,
125125
custom_attributes=None,
126+
accelerator_type=None,
126127
):
127128
"""Initializes a configuration of a model and the endpoint to be created for it.
128129
@@ -151,6 +152,9 @@ def __init__(
151152
Section 3.3.6. Field Value Components (
152153
https://tools.ietf.org/html/rfc7230#section-3.2.6) of the Hypertext Transfer
153154
Protocol (HTTP/1.1).
155+
accelerator_type (str): The Elastic Inference accelerator type to deploy to the model
156+
endpoint instance for making inferences to the model, see
157+
https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html.
154158
"""
155159
self.predictor_config = {
156160
"model_name": model_name,
@@ -178,9 +182,8 @@ def __init__(
178182
f" Please include a placeholder $features."
179183
)
180184
self.predictor_config["content_template"] = content_template
181-
182-
if custom_attributes is not None:
183-
self.predictor_config["custom_attributes"] = custom_attributes
185+
_set(custom_attributes, "custom_attributes", self.predictor_config)
186+
_set(accelerator_type, "accelerator_type", self.predictor_config)
184187

185188
def get_predictor_config(self):
186189
"""Returns part of the predictor dictionary of the analysis config."""

tests/conftest.py

+12-7
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ def pytorch_inference_py_version(pytorch_inference_version, request):
190190
return "py3"
191191

192192

193-
def _huggingface_pytorch_version(huggingface_vesion):
193+
def _huggingface_base_fm_version(huggingface_vesion, base_fw):
194194
config = image_uris.config_for_framework("huggingface")
195195
training_config = config.get("training")
196196
original_version = huggingface_vesion
@@ -200,21 +200,26 @@ def _huggingface_pytorch_version(huggingface_vesion):
200200
)
201201
version_config = training_config.get("versions").get(huggingface_vesion)
202202
for key in list(version_config.keys()):
203-
if key.startswith("pytorch"):
204-
pt_version = key[7:]
203+
if key.startswith(base_fw):
204+
base_fw_version = key[len(base_fw) :]
205205
if len(original_version.split(".")) == 2:
206-
pt_version = ".".join(pt_version.split(".")[:-1])
207-
return pt_version
206+
base_fw_version = ".".join(base_fw_version.split(".")[:-1])
207+
return base_fw_version
208208

209209

210210
@pytest.fixture(scope="module")
211211
def huggingface_pytorch_version(huggingface_training_version):
212-
return _huggingface_pytorch_version(huggingface_training_version)
212+
return _huggingface_base_fm_version(huggingface_training_version, "pytorch")
213213

214214

215215
@pytest.fixture(scope="module")
216216
def huggingface_pytorch_latest_version(huggingface_training_latest_version):
217-
return _huggingface_pytorch_version(huggingface_training_latest_version)
217+
return _huggingface_base_fm_version(huggingface_training_latest_version, "pytorch")
218+
219+
220+
@pytest.fixture(scope="module")
221+
def huggingface_tensorflow_latest_version(huggingface_training_latest_version):
222+
return _huggingface_base_fm_version(huggingface_training_latest_version, "tensorflow")
218223

219224

220225
@pytest.fixture(scope="module")

0 commit comments

Comments
 (0)