Skip to content

Commit 3bf949e

Browse files
authored
Merge branch 'master' into PT113Release
2 parents 87cbed1 + 0eade55 commit 3bf949e

File tree

9 files changed

+134
-21
lines changed

9 files changed

+134
-21
lines changed

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,24 @@
11
# Changelog
22

3+
## v2.151.0 (2023-04-27)
4+
5+
### Features
6+
7+
* Update Transformers 4.26 - TensorFlow 2.11.0 Image URI
8+
* Add Extra Parameters to Lambda Function Wrapper
9+
10+
### Bug Fixes and Other Changes
11+
12+
* Add kms key support for Model registration
13+
* Enable inference recommender slow tests
14+
* Pass sagemaker session to downstream s3 calls
15+
* Add ap-south-1 to no p3 regions
16+
* skip test for p2 instance for TF2.12 and above
17+
18+
### Documentation Changes
19+
20+
* Fix minor misses from the remote function doc release
21+
322
## v2.150.0 (2023-04-26)
423

524
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.150.1.dev0
1+
2.151.1.dev0

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

Lines changed: 78 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,88 @@ Release Notes
33
#############
44

55
New features, bug fixes, and improvements are regularly made to the SageMaker
6-
distributed model parallel library.
6+
model parallelism library.
77

88

9-
SageMaker Distributed Model Parallel 1.14.0 Release Notes
9+
SageMaker Distributed Model Parallel 1.15.0 Release Notes
1010
=========================================================
1111

12+
*Date: Apr. 27. 2023*
13+
14+
**Currency Updates**
15+
16+
* Added support for PyTorch v2.0.0.
17+
Note that the library does not support ``torch.compile`` in this release.
18+
19+
**New Features**
20+
21+
* Using sharded data parallelism with tensor parallelism together is now
22+
available for PyTorch 1.13.1. It allows you to train with smaller global batch
23+
sizes while scaling up to large clusters. For more information, see `Sharded
24+
data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
25+
in the *Amazon SageMaker Developer Guide*.
26+
* Added support for saving and loading full model checkpoints when using sharded
27+
data parallelism. This is enabled by using the standard checkpointing API,
28+
``smp.save_checkpoint`` with ``partial=False``.
29+
Before, full checkpoints needed to be created by merging partial checkpoint
30+
files after training finishes.
31+
* `DistributedTransformer <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.nn.DistributedTransformerLayer>`_
32+
now supports the ALiBi position embeddings.
33+
When using DistributedTransformer, you can set the ``use_alibi`` parameter
34+
to ``True`` to use the Triton-based flash attention kernels. This helps
35+
evaluate sequences longer than those used for training.
36+
37+
**Bug Fixes**
38+
39+
* When using tensor parallelism, parameters were initialized multiple times
40+
unncessarily. This release fixed the multiple initialization of parameters
41+
so that each parameter is initialized exactly once.
42+
It not only saves time, but also ensures that the random generator behavior
43+
is similar to the non-tensor parallelism case.
44+
45+
**Known issues**
46+
47+
* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
48+
49+
**Migration to AWS Deep Learning Containers**
50+
51+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
52+
53+
- SageMaker training container for PyTorch v2.0.0
54+
55+
.. code::
56+
57+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
58+
59+
- SageMaker training container for PyTorch v1.13.1
60+
61+
.. code::
62+
63+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
64+
65+
Binary file of this version of the library for `custom container
66+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
67+
68+
- For PyTorch v2.0.0
69+
70+
.. code::
71+
72+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl
73+
74+
- For PyTorch v1.13.1
75+
76+
.. code::
77+
78+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl
79+
80+
----
81+
82+
Release History
83+
===============
84+
85+
SageMaker Distributed Model Parallel 1.14.0 Release Notes
86+
---------------------------------------------------------
87+
1288
*Date: Jan. 30. 2023*
1389

1490
**Currency Updates**
@@ -39,10 +115,6 @@ Binary file of this version of the library for `custom container
39115
40116
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
41117
42-
----
43-
44-
Release History
45-
===============
46118
47119
SageMaker Distributed Model Parallel 1.13.0 Release Notes
48120
---------------------------------------------------------

doc/api/training/smp_versions/latest.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.11.0, 1.13.0, 1.14.0 (Latest)
14-
=======================================
13+
Version 1.11.0, 1.13.0, 1.14.0, 1.15.0 (Latest)
14+
===============================================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.
1717

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,20 @@ Tensor Parallelism Module APIs
302302
- ``post_layernorm``: If ``True``, inserts layer normalization at
303303
the output. At least one of ``pre_layernorm`` and
304304
``post_layernorm`` must be ``True``.
305+
- ``use_alibi`` (bool, default False): Activates Attention with
306+
Linear Biases (ALiBi) for attention computation.
307+
ALiBi facilitates efficient extrapolation on input sequences
308+
and thus improves training efficiency.
309+
The library enables ALiBi by using the `Triton
310+
flash attention kernel
311+
<https://github.com/HazyResearch/flash-attention>`_.
312+
Refer to https://arxiv.org/abs/2108.12409 for more
313+
details on the technique.
314+
(Available from
315+
the SageMaker model parallelism library v1.15.0.)
316+
- ``alibi_bias_max`` (int, default 8): Defines the ALiBi base
317+
value for mask generation. (Available from
318+
the SageMaker model parallelism library v1.15.0.)
305319

306320
- **Methods:**
307321

src/sagemaker/clarify.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,8 @@
9494
{object: object},
9595
)
9696
],
97+
# Arbitrary JSON object as baseline
98+
{object: object},
9799
),
98100
SchemaOptional("num_clusters"): int,
99101
SchemaOptional("use_logit"): bool,
@@ -1211,7 +1213,7 @@ class SHAPConfig(ExplainabilityConfig):
12111213

12121214
def __init__(
12131215
self,
1214-
baseline: Optional[Union[str, List]] = None,
1216+
baseline: Optional[Union[str, List, Dict]] = None,
12151217
num_samples: Optional[int] = None,
12161218
agg_method: Optional[str] = None,
12171219
use_logit: bool = False,
@@ -1224,7 +1226,7 @@ def __init__(
12241226
"""Initializes config for SHAP analysis.
12251227
12261228
Args:
1227-
baseline (None or str or list): `Baseline dataset <https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html>`_
1229+
baseline (None or str or list or dict): `Baseline dataset <https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html>`_
12281230
for the Kernel SHAP algorithm, accepted in the form of:
12291231
S3 object URI, a list of rows (with at least one element),
12301232
or None (for no input baseline). The baseline dataset must have the same format

src/sagemaker/image_uri_config/djl-deepspeed.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
"us-west-2": "763104351884"
3030
},
3131
"repository": "djl-inference",
32-
"tag_prefix": "0.21.0-deepspeed0.8.0-cu117"
32+
"tag_prefix": "0.21.0-deepspeed0.8.3-cu117"
3333
},
3434
"0.20.0": {
3535
"registries": {

tests/unit/sagemaker/image_uris/test_djl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
"0.19.0": {"djl-deepspeed": "deepspeed0.7.3-cu113"},
4848
"0.20.0": {"djl-deepspeed": "deepspeed0.7.5-cu116"},
4949
"0.21.0": {
50-
"djl-deepspeed": "deepspeed0.8.0-cu117",
50+
"djl-deepspeed": "deepspeed0.8.3-cu117",
5151
"djl-fastertransformer": "fastertransformer5.3.0-cu117",
5252
},
5353
}

tests/unit/test_clarify.py

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -563,14 +563,20 @@ def test_invalid_model_predicted_label_config():
563563
)
564564

565565

566-
def test_shap_config():
567-
baseline = [
568-
[
569-
0.26124998927116394,
570-
0.2824999988079071,
571-
0.06875000149011612,
572-
]
573-
]
566+
@pytest.mark.parametrize(
567+
"baseline",
568+
[
569+
([[0.26124998927116394, 0.2824999988079071, 0.06875000149011612]]),
570+
(
571+
{
572+
"instances": [
573+
{"features": [0.26124998927116394, 0.2824999988079071, 0.06875000149011612]}
574+
]
575+
}
576+
),
577+
],
578+
)
579+
def test_valid_shap_config(baseline):
574580
num_samples = 100
575581
agg_method = "mean_sq"
576582
use_logit = True

0 commit comments

Comments
 (0)