Skip to content

Commit 3eeccfb

Browse files
authored
Merge branch 'master' into master
2 parents 7b55268 + 38a3a2d commit 3eeccfb

File tree

104 files changed

+9734
-386
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+9734
-386
lines changed

.flake8

+1
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@ application_import_names = sagemaker, tests
33
import-order-style = google
44
per-file-ignores =
55
tests/unit/test_tuner.py: F405
6+
src/sagemaker/config/config_schema.py: E501

CHANGELOG.md

+61
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,66 @@
11
# Changelog
22

3+
## v2.151.0 (2023-04-27)
4+
5+
### Features
6+
7+
* Update Transformers 4.26 - TensorFlow 2.11.0 Image URI
8+
* Add Extra Parameters to Lambda Function Wrapper
9+
10+
### Bug Fixes and Other Changes
11+
12+
* Add kms key support for Model registration
13+
* Enable inference recommender slow tests
14+
* Pass sagemaker session to downstream s3 calls
15+
* Add ap-south-1 to no p3 regions
16+
* skip test for p2 instance for TF2.12 and above
17+
18+
### Documentation Changes
19+
20+
* Fix minor misses from the remote function doc release
21+
22+
## v2.150.0 (2023-04-26)
23+
24+
### Features
25+
26+
* Introduce TensorBoard app class
27+
28+
### Bug Fixes and Other Changes
29+
30+
* Update data wrangler images
31+
32+
## v2.149.0 (2023-04-25)
33+
34+
### Features
35+
36+
* Support TF2.12 SageMaker DLC
37+
38+
### Bug Fixes and Other Changes
39+
40+
* update the doc for Join function
41+
* change s3UploadMode of sagemaker clarify processing output for computer vision jobs.
42+
43+
### Documentation Changes
44+
45+
* Add Remote Function updates
46+
47+
## v2.148.0 (2023-04-20)
48+
49+
### Features
50+
51+
* [huggingface] Add `torch.distributed` support for Trainium and `torchrun`
52+
* Add PyTorch 2.0 to SDK
53+
54+
### Bug Fixes and Other Changes
55+
56+
* updating batch transform job in monitoring schedule
57+
58+
## v2.147.0 (2023-04-18)
59+
60+
### Features
61+
62+
* support different types of deletion mode
63+
364
## v2.146.1 (2023-04-17)
465

566
### Bug Fixes and Other Changes

README.rst

+2
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,8 @@ To run the integration tests, the following prerequisites must be met
133133
1. AWS account credentials are available in the environment for the boto3 client to use.
134134
2. The AWS account has an IAM role named :code:`SageMakerRole`.
135135
It should have the AmazonSageMakerFullAccess policy attached as well as a policy with `the necessary permissions to use Elastic Inference <https://docs.aws.amazon.com/sagemaker/latest/dg/ei-setup.html>`__.
136+
3. To run remote_function tests, dummy ecr repo should be created. It can be created by running -
137+
:code:`aws ecr create-repository --repository-name remote-function-dummy-container`
136138

137139
We recommend selectively running just those integration tests you'd like to run. You can filter by individual test function names with:
138140

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.146.2.dev0
1+
2.151.1.dev0

doc/api/training/sdp_versions/latest.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ depending on the version of the library you use.
2626
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
2727
for more information.
2828

29-
For versions between 1.4.0 and 1.7.0 (Latest)
29+
For versions between 1.4.0 and 1.8.0 (Latest)
3030
=============================================
3131

3232
.. toctree::

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst

+32-7
Original file line numberDiff line numberDiff line change
@@ -5,39 +5,64 @@ Release Notes
55
#############
66

77
New features, bug fixes, and improvements are regularly made to the SageMaker
8-
distributed data parallel library.
8+
data parallelism library.
99

10-
SageMaker Distributed Data Parallel 1.7.0 Release Notes
10+
SageMaker Distributed Data Parallel 1.8.0 Release Notes
1111
=======================================================
1212

13-
*Date: Feb. 10. 2023*
13+
*Date: Apr. 17. 2023*
1414

1515
**Currency Updates**
1616

17-
* Added support for PyTorch 1.13.1.
17+
* Added support for PyTorch 2.0.0.
1818

1919
**Migration to AWS Deep Learning Containers**
2020

2121
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
2222

23-
- PyTorch 1.13.1 DLC
23+
- PyTorch 2.0.0 DLC
2424

2525
.. code::
2626
27-
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
27+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
2828
2929
Binary file of this version of the library for custom container users:
3030

3131
.. code::
3232
33-
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed_dataparallel-1.7.0-cp39-cp39-linux_x86_64.whl
33+
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed_dataparallel-1.8.0-cp310-cp310-linux_x86_64.whl
3434
3535
3636
----
3737

3838
Release History
3939
===============
4040

41+
SageMaker Distributed Data Parallel 1.7.0 Release Notes
42+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
43+
44+
*Date: Feb. 10. 2023*
45+
46+
**Currency Updates**
47+
48+
* Added support for PyTorch 1.13.1.
49+
50+
**Migration to AWS Deep Learning Containers**
51+
52+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
53+
54+
- PyTorch 1.13.1 DLC
55+
56+
.. code::
57+
58+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
59+
60+
Binary file of this version of the library for custom container users:
61+
62+
.. code::
63+
64+
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed_dataparallel-1.7.0-cp39-cp39-linux_x86_64.whl
65+
4166
SageMaker Distributed Data Parallel 1.6.0 Release Notes
4267
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4368

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

+78-6
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,88 @@ Release Notes
33
#############
44

55
New features, bug fixes, and improvements are regularly made to the SageMaker
6-
distributed model parallel library.
6+
model parallelism library.
77

88

9-
SageMaker Distributed Model Parallel 1.14.0 Release Notes
9+
SageMaker Distributed Model Parallel 1.15.0 Release Notes
1010
=========================================================
1111

12+
*Date: Apr. 27. 2023*
13+
14+
**Currency Updates**
15+
16+
* Added support for PyTorch v2.0.0.
17+
Note that the library does not support ``torch.compile`` in this release.
18+
19+
**New Features**
20+
21+
* Using sharded data parallelism with tensor parallelism together is now
22+
available for PyTorch 1.13.1. It allows you to train with smaller global batch
23+
sizes while scaling up to large clusters. For more information, see `Sharded
24+
data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
25+
in the *Amazon SageMaker Developer Guide*.
26+
* Added support for saving and loading full model checkpoints when using sharded
27+
data parallelism. This is enabled by using the standard checkpointing API,
28+
``smp.save_checkpoint`` with ``partial=False``.
29+
Before, full checkpoints needed to be created by merging partial checkpoint
30+
files after training finishes.
31+
* `DistributedTransformer <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.nn.DistributedTransformerLayer>`_
32+
now supports the ALiBi position embeddings.
33+
When using DistributedTransformer, you can set the ``use_alibi`` parameter
34+
to ``True`` to use the Triton-based flash attention kernels. This helps
35+
evaluate sequences longer than those used for training.
36+
37+
**Bug Fixes**
38+
39+
* When using tensor parallelism, parameters were initialized multiple times
40+
unncessarily. This release fixed the multiple initialization of parameters
41+
so that each parameter is initialized exactly once.
42+
It not only saves time, but also ensures that the random generator behavior
43+
is similar to the non-tensor parallelism case.
44+
45+
**Known issues**
46+
47+
* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
48+
49+
**Migration to AWS Deep Learning Containers**
50+
51+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
52+
53+
- SageMaker training container for PyTorch v2.0.0
54+
55+
.. code::
56+
57+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
58+
59+
- SageMaker training container for PyTorch v1.13.1
60+
61+
.. code::
62+
63+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
64+
65+
Binary file of this version of the library for `custom container
66+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
67+
68+
- For PyTorch v2.0.0
69+
70+
.. code::
71+
72+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl
73+
74+
- For PyTorch v1.13.1
75+
76+
.. code::
77+
78+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl
79+
80+
----
81+
82+
Release History
83+
===============
84+
85+
SageMaker Distributed Model Parallel 1.14.0 Release Notes
86+
---------------------------------------------------------
87+
1288
*Date: Jan. 30. 2023*
1389

1490
**Currency Updates**
@@ -39,10 +115,6 @@ Binary file of this version of the library for `custom container
39115
40116
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
41117
42-
----
43-
44-
Release History
45-
===============
46118
47119
SageMaker Distributed Model Parallel 1.13.0 Release Notes
48120
---------------------------------------------------------

doc/api/training/smp_versions/latest.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.11.0, 1.13.0, 1.14.0 (Latest)
14-
=======================================
13+
Version 1.11.0, 1.13.0, 1.14.0, 1.15.0 (Latest)
14+
===============================================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.
1717

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst

+14
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,20 @@ Tensor Parallelism Module APIs
302302
- ``post_layernorm``: If ``True``, inserts layer normalization at
303303
the output. At least one of ``pre_layernorm`` and
304304
``post_layernorm`` must be ``True``.
305+
- ``use_alibi`` (bool, default False): Activates Attention with
306+
Linear Biases (ALiBi) for attention computation.
307+
ALiBi facilitates efficient extrapolation on input sequences
308+
and thus improves training efficiency.
309+
The library enables ALiBi by using the `Triton
310+
flash attention kernel
311+
<https://github.com/HazyResearch/flash-attention>`_.
312+
Refer to https://arxiv.org/abs/2108.12409 for more
313+
details on the technique.
314+
(Available from
315+
the SageMaker model parallelism library v1.15.0.)
316+
- ``alibi_bias_max`` (int, default 8): Defines the ALiBi base
317+
value for mask generation. (Available from
318+
the SageMaker model parallelism library v1.15.0.)
305319

306320
- **Methods:**
307321

doc/frameworks/djl/using_djl.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ You can either deploy your model using DeepSpeed or HuggingFace Accelerate, or l
3131
djl_model = DJLModel(
3232
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
3333
"my_sagemaker_role",
34-
data_type="fp16",
34+
dtype="fp16",
3535
task="text-generation",
3636
number_of_partitions=2 # number of gpus to partition the model across
3737
)
@@ -48,7 +48,7 @@ If you want to use a specific backend, then you can create an instance of the co
4848
deepspeed_model = DeepSpeedModel(
4949
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
5050
"my_sagemaker_role",
51-
data_type="bf16",
51+
dtype="bf16",
5252
task="text-generation",
5353
tensor_parallel_degree=2, # number of gpus to partition the model across using tensor parallelism
5454
)
@@ -58,7 +58,7 @@ If you want to use a specific backend, then you can create an instance of the co
5858
hf_accelerate_model = HuggingFaceAccelerateModel(
5959
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
6060
"my_sagemaker_role",
61-
data_type="fp16",
61+
dtype="fp16",
6262
task="text-generation",
6363
number_of_partitions=2, # number of gpus to partition the model across
6464
)
@@ -109,7 +109,7 @@ For example, you can deploy the EleutherAI gpt-j-6B model like this:
109109
model = DJLModel(
110110
"EleutherAI/gpt-j-6B",
111111
"my_sagemaker_role",
112-
data_type="fp16",
112+
dtype="fp16",
113113
number_of_partitions=2
114114
)
115115
@@ -142,7 +142,7 @@ You would then pass "s3://my_bucket/gpt-j-6B" as ``model_id`` to the ``DJLModel`
142142
model = DJLModel(
143143
"s3://my_bucket/gpt-j-6B",
144144
"my_sagemaker_role",
145-
data_type="fp16",
145+
dtype="fp16",
146146
number_of_partitions=2
147147
)
148148

doc/frameworks/pytorch/using_pytorch.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -892,7 +892,7 @@ see `For versions 1.1 and lower <#for-versions-1.1-and-lower>`_.
892892
| |--inference.py
893893
| |--requirements.txt
894894

895-
Where ``requirments.txt`` is an optional file that specifies dependencies on third-party libraries.
895+
Where ``requirements.txt`` is an optional file that specifies dependencies on third-party libraries.
896896

897897
Create a ``PyTorchModel`` object
898898
--------------------------------

0 commit comments

Comments
 (0)