Skip to content

Commit 559db04

Browse files
committed
Merge remote-tracking branch 'origin' into feat/jumpstart-model-estimator-classes
2 parents 738df16 + af26174 commit 559db04

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+1193
-155
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ env/
3030
.vscode/
3131
**/tmp
3232
.python-version
33+
*.html
3334
**/_repack_script_launcher.sh
3435
tests/data/**/_repack_model.py
3536
tests/data/experiment/sagemaker-dev-1.0.tar.gz

CHANGELOG.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,30 @@
11
# Changelog
22

3+
## v2.152.0 (2023-05-04)
4+
5+
### Features
6+
7+
* add support for lineage visualization using pyvis
8+
* Expose Experiment class publicly
9+
* PyTorch 1.13 release
10+
11+
### Bug Fixes and Other Changes
12+
13+
* Change data_type argument to dtype to keep consistent with D…
14+
* Skip edge test
15+
* make RemoteExecutor context manager non-blocking on pending futures
16+
* Add inferentia2 DLC images for djl framework
17+
* Fix typo in using_pytorch.rst
18+
* Unable to attach estimator to training job when KeepAlivePeriodInSeconds specified
19+
* update LMI container image
20+
* Update Clarify SHAPConfig baseline to allow JSON structures
21+
22+
### Documentation Changes
23+
24+
* Fix broken link in DJL SageMaker docs
25+
* currency update for the SageMaker data parallelism lib
26+
* SM model parallel library v1.15.0 release note
27+
328
## v2.151.0 (2023-04-27)
429

530
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.151.1.dev0
1+
2.152.1.dev0

doc/api/training/sdp_versions/latest.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ depending on the version of the library you use.
2626
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
2727
for more information.
2828

29-
For versions between 1.4.0 and 1.7.0 (Latest)
29+
For versions between 1.4.0 and 1.8.0 (Latest)
3030
=============================================
3131

3232
.. toctree::

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,39 +5,64 @@ Release Notes
55
#############
66

77
New features, bug fixes, and improvements are regularly made to the SageMaker
8-
distributed data parallel library.
8+
data parallelism library.
99

10-
SageMaker Distributed Data Parallel 1.7.0 Release Notes
10+
SageMaker Distributed Data Parallel 1.8.0 Release Notes
1111
=======================================================
1212

13-
*Date: Feb. 10. 2023*
13+
*Date: Apr. 17. 2023*
1414

1515
**Currency Updates**
1616

17-
* Added support for PyTorch 1.13.1.
17+
* Added support for PyTorch 2.0.0.
1818

1919
**Migration to AWS Deep Learning Containers**
2020

2121
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
2222

23-
- PyTorch 1.13.1 DLC
23+
- PyTorch 2.0.0 DLC
2424

2525
.. code::
2626
27-
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
27+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
2828
2929
Binary file of this version of the library for custom container users:
3030

3131
.. code::
3232
33-
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed_dataparallel-1.7.0-cp39-cp39-linux_x86_64.whl
33+
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed_dataparallel-1.8.0-cp310-cp310-linux_x86_64.whl
3434
3535
3636
----
3737

3838
Release History
3939
===============
4040

41+
SageMaker Distributed Data Parallel 1.7.0 Release Notes
42+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
43+
44+
*Date: Feb. 10. 2023*
45+
46+
**Currency Updates**
47+
48+
* Added support for PyTorch 1.13.1.
49+
50+
**Migration to AWS Deep Learning Containers**
51+
52+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
53+
54+
- PyTorch 1.13.1 DLC
55+
56+
.. code::
57+
58+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
59+
60+
Binary file of this version of the library for custom container users:
61+
62+
.. code::
63+
64+
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed_dataparallel-1.7.0-cp39-cp39-linux_x86_64.whl
65+
4166
SageMaker Distributed Data Parallel 1.6.0 Release Notes
4267
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4368

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

Lines changed: 78 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,88 @@ Release Notes
33
#############
44

55
New features, bug fixes, and improvements are regularly made to the SageMaker
6-
distributed model parallel library.
6+
model parallelism library.
77

88

9-
SageMaker Distributed Model Parallel 1.14.0 Release Notes
9+
SageMaker Distributed Model Parallel 1.15.0 Release Notes
1010
=========================================================
1111

12+
*Date: Apr. 27. 2023*
13+
14+
**Currency Updates**
15+
16+
* Added support for PyTorch v2.0.0.
17+
Note that the library does not support ``torch.compile`` in this release.
18+
19+
**New Features**
20+
21+
* Using sharded data parallelism with tensor parallelism together is now
22+
available for PyTorch 1.13.1. It allows you to train with smaller global batch
23+
sizes while scaling up to large clusters. For more information, see `Sharded
24+
data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
25+
in the *Amazon SageMaker Developer Guide*.
26+
* Added support for saving and loading full model checkpoints when using sharded
27+
data parallelism. This is enabled by using the standard checkpointing API,
28+
``smp.save_checkpoint`` with ``partial=False``.
29+
Before, full checkpoints needed to be created by merging partial checkpoint
30+
files after training finishes.
31+
* `DistributedTransformer <https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.nn.DistributedTransformerLayer>`_
32+
now supports the ALiBi position embeddings.
33+
When using DistributedTransformer, you can set the ``use_alibi`` parameter
34+
to ``True`` to use the Triton-based flash attention kernels. This helps
35+
evaluate sequences longer than those used for training.
36+
37+
**Bug Fixes**
38+
39+
* When using tensor parallelism, parameters were initialized multiple times
40+
unncessarily. This release fixed the multiple initialization of parameters
41+
so that each parameter is initialized exactly once.
42+
It not only saves time, but also ensures that the random generator behavior
43+
is similar to the non-tensor parallelism case.
44+
45+
**Known issues**
46+
47+
* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
48+
49+
**Migration to AWS Deep Learning Containers**
50+
51+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
52+
53+
- SageMaker training container for PyTorch v2.0.0
54+
55+
.. code::
56+
57+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
58+
59+
- SageMaker training container for PyTorch v1.13.1
60+
61+
.. code::
62+
63+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
64+
65+
Binary file of this version of the library for `custom container
66+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
67+
68+
- For PyTorch v2.0.0
69+
70+
.. code::
71+
72+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed_modelparallel-1.15.0-cp310-cp310-linux_x86_64.whl
73+
74+
- For PyTorch v1.13.1
75+
76+
.. code::
77+
78+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed_modelparallel-1.15.0-cp39-cp39-linux_x86_64.whl
79+
80+
----
81+
82+
Release History
83+
===============
84+
85+
SageMaker Distributed Model Parallel 1.14.0 Release Notes
86+
---------------------------------------------------------
87+
1288
*Date: Jan. 30. 2023*
1389

1490
**Currency Updates**
@@ -39,10 +115,6 @@ Binary file of this version of the library for `custom container
39115
40116
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
41117
42-
----
43-
44-
Release History
45-
===============
46118
47119
SageMaker Distributed Model Parallel 1.13.0 Release Notes
48120
---------------------------------------------------------

doc/api/training/smp_versions/latest.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.11.0, 1.13.0, 1.14.0 (Latest)
14-
=======================================
13+
Version 1.11.0, 1.13.0, 1.14.0, 1.15.0 (Latest)
14+
===============================================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.
1717

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,20 @@ Tensor Parallelism Module APIs
302302
- ``post_layernorm``: If ``True``, inserts layer normalization at
303303
the output. At least one of ``pre_layernorm`` and
304304
``post_layernorm`` must be ``True``.
305+
- ``use_alibi`` (bool, default False): Activates Attention with
306+
Linear Biases (ALiBi) for attention computation.
307+
ALiBi facilitates efficient extrapolation on input sequences
308+
and thus improves training efficiency.
309+
The library enables ALiBi by using the `Triton
310+
flash attention kernel
311+
<https://github.com/HazyResearch/flash-attention>`_.
312+
Refer to https://arxiv.org/abs/2108.12409 for more
313+
details on the technique.
314+
(Available from
315+
the SageMaker model parallelism library v1.15.0.)
316+
- ``alibi_bias_max`` (int, default 8): Defines the ALiBi base
317+
value for mask generation. (Available from
318+
the SageMaker model parallelism library v1.15.0.)
305319

306320
- **Methods:**
307321

doc/experiments/sagemaker.experiments.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,15 @@ Run
1111

1212
.. automethod:: sagemaker.experiments.list_runs
1313

14+
Experiment
15+
-------------
16+
17+
.. autoclass:: sagemaker.experiments.Experiment
18+
:members:
19+
20+
Other
21+
-------------
22+
1423
.. autoclass:: sagemaker.experiments.SortByType
1524
:members:
1625
:undoc-members:

doc/frameworks/djl/using_djl.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ You can either deploy your model using DeepSpeed or HuggingFace Accelerate, or l
3131
djl_model = DJLModel(
3232
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
3333
"my_sagemaker_role",
34-
data_type="fp16",
34+
dtype="fp16",
3535
task="text-generation",
3636
number_of_partitions=2 # number of gpus to partition the model across
3737
)
@@ -48,7 +48,7 @@ If you want to use a specific backend, then you can create an instance of the co
4848
deepspeed_model = DeepSpeedModel(
4949
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
5050
"my_sagemaker_role",
51-
data_type="bf16",
51+
dtype="bf16",
5252
task="text-generation",
5353
tensor_parallel_degree=2, # number of gpus to partition the model across using tensor parallelism
5454
)
@@ -58,7 +58,7 @@ If you want to use a specific backend, then you can create an instance of the co
5858
hf_accelerate_model = HuggingFaceAccelerateModel(
5959
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
6060
"my_sagemaker_role",
61-
data_type="fp16",
61+
dtype="fp16",
6262
task="text-generation",
6363
number_of_partitions=2, # number of gpus to partition the model across
6464
)
@@ -109,7 +109,7 @@ For example, you can deploy the EleutherAI gpt-j-6B model like this:
109109
model = DJLModel(
110110
"EleutherAI/gpt-j-6B",
111111
"my_sagemaker_role",
112-
data_type="fp16",
112+
dtype="fp16",
113113
number_of_partitions=2
114114
)
115115
@@ -142,7 +142,7 @@ You would then pass "s3://my_bucket/gpt-j-6B" as ``model_id`` to the ``DJLModel`
142142
model = DJLModel(
143143
"s3://my_bucket/gpt-j-6B",
144144
"my_sagemaker_role",
145-
data_type="fp16",
145+
dtype="fp16",
146146
number_of_partitions=2
147147
)
148148
@@ -213,7 +213,7 @@ For more information about DJL Serving, see the `DJL Serving documentation. <htt
213213
SageMaker DJL Classes
214214
***********************
215215

216-
For information about the different DJL Serving related classes in the SageMaker Python SDK, see https://sagemaker.readthedocs.io/en/stable/sagemaker.djl_inference.html.
216+
For information about the different DJL Serving related classes in the SageMaker Python SDK, see https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html.
217217

218218
********************************
219219
SageMaker DJL Serving Containers

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -892,7 +892,7 @@ see `For versions 1.1 and lower <#for-versions-1.1-and-lower>`_.
892892
| |--inference.py
893893
| |--requirements.txt
894894

895-
Where ``requirments.txt`` is an optional file that specifies dependencies on third-party libraries.
895+
Where ``requirements.txt`` is an optional file that specifies dependencies on third-party libraries.
896896

897897
Create a ``PyTorchModel`` object
898898
--------------------------------

requirements/extras/test_requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ fabric==2.6.0
1919
requests==2.27.1
2020
sagemaker-experiments==0.1.35
2121
Jinja2==3.0.3
22+
pyvis==0.2.1
2223
pandas>=1.3.5,<1.5
2324
scikit-learn==1.0.2
2425
cloudpickle==2.2.1

src/sagemaker/clarify.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,8 @@
9494
{object: object},
9595
)
9696
],
97+
# Arbitrary JSON object as baseline
98+
{object: object},
9799
),
98100
SchemaOptional("num_clusters"): int,
99101
SchemaOptional("use_logit"): bool,
@@ -1211,7 +1213,7 @@ class SHAPConfig(ExplainabilityConfig):
12111213

12121214
def __init__(
12131215
self,
1214-
baseline: Optional[Union[str, List]] = None,
1216+
baseline: Optional[Union[str, List, Dict]] = None,
12151217
num_samples: Optional[int] = None,
12161218
agg_method: Optional[str] = None,
12171219
use_logit: bool = False,
@@ -1224,7 +1226,7 @@ def __init__(
12241226
"""Initializes config for SHAP analysis.
12251227
12261228
Args:
1227-
baseline (None or str or list): `Baseline dataset <https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html>`_
1229+
baseline (None or str or list or dict): `Baseline dataset <https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-feature-attribute-shap-baselines.html>`_
12281230
for the Kernel SHAP algorithm, accepted in the form of:
12291231
S3 object URI, a list of rows (with at least one element),
12301232
or None (for no input baseline). The baseline dataset must have the same format

0 commit comments

Comments
 (0)