Skip to content

Commit 973810a

Browse files
authored
Merge branch 'master' into 3826-bugfix-lambda-step
2 parents 9f093f4 + af26174 commit 973810a

File tree

31 files changed

+1013
-119
lines changed

31 files changed

+1013
-119
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ env/
3030
.vscode/
3131
**/tmp
3232
.python-version
33+
*.html
3334
**/_repack_script_launcher.sh
3435
tests/data/**/_repack_model.py
3536
tests/data/experiment/sagemaker-dev-1.0.tar.gz

CHANGELOG.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,30 @@
11
# Changelog
22

3+
## v2.152.0 (2023-05-04)
4+
5+
### Features
6+
7+
* add support for lineage visualization using pyvis
8+
* Expose Experiment class publicly
9+
* PyTorch 1.13 release
10+
11+
### Bug Fixes and Other Changes
12+
13+
* Change data_type argument to dtype to keep consistent with D…
14+
* Skip edge test
15+
* make RemoteExecutor context manager non-blocking on pending futures
16+
* Add inferentia2 DLC images for djl framework
17+
* Fix typo in using_pytorch.rst
18+
* Unable to attach estimator to training job when KeepAlivePeriodInSeconds specified
19+
* update LMI container image
20+
* Update Clarify SHAPConfig baseline to allow JSON structures
21+
22+
### Documentation Changes
23+
24+
* Fix broken link in DJL SageMaker docs
25+
* currency update for the SageMaker data parallelism lib
26+
* SM model parallel library v1.15.0 release note
27+
328
## v2.151.0 (2023-04-27)
429

530
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.151.1.dev0
1+
2.152.1.dev0

doc/api/training/sdp_versions/latest.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ depending on the version of the library you use.
2626
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
2727
for more information.
2828

29-
For versions between 1.4.0 and 1.7.0 (Latest)
29+
For versions between 1.4.0 and 1.8.0 (Latest)
3030
=============================================
3131

3232
.. toctree::

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst

Lines changed: 32 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,39 +5,64 @@ Release Notes
55
#############
66

77
New features, bug fixes, and improvements are regularly made to the SageMaker
8-
distributed data parallel library.
8+
data parallelism library.
99

10-
SageMaker Distributed Data Parallel 1.7.0 Release Notes
10+
SageMaker Distributed Data Parallel 1.8.0 Release Notes
1111
=======================================================
1212

13-
*Date: Feb. 10. 2023*
13+
*Date: Apr. 17. 2023*
1414

1515
**Currency Updates**
1616

17-
* Added support for PyTorch 1.13.1.
17+
* Added support for PyTorch 2.0.0.
1818

1919
**Migration to AWS Deep Learning Containers**
2020

2121
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
2222

23-
- PyTorch 1.13.1 DLC
23+
- PyTorch 2.0.0 DLC
2424

2525
.. code::
2626
27-
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
27+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker
2828
2929
Binary file of this version of the library for custom container users:
3030

3131
.. code::
3232
33-
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed_dataparallel-1.7.0-cp39-cp39-linux_x86_64.whl
33+
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed_dataparallel-1.8.0-cp310-cp310-linux_x86_64.whl
3434
3535
3636
----
3737

3838
Release History
3939
===============
4040

41+
SageMaker Distributed Data Parallel 1.7.0 Release Notes
42+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
43+
44+
*Date: Feb. 10. 2023*
45+
46+
**Currency Updates**
47+
48+
* Added support for PyTorch 1.13.1.
49+
50+
**Migration to AWS Deep Learning Containers**
51+
52+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
53+
54+
- PyTorch 1.13.1 DLC
55+
56+
.. code::
57+
58+
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
59+
60+
Binary file of this version of the library for custom container users:
61+
62+
.. code::
63+
64+
https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed_dataparallel-1.7.0-cp39-cp39-linux_x86_64.whl
65+
4166
SageMaker Distributed Data Parallel 1.6.0 Release Notes
4267
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4368

doc/experiments/sagemaker.experiments.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,15 @@ Run
1111

1212
.. automethod:: sagemaker.experiments.list_runs
1313

14+
Experiment
15+
-------------
16+
17+
.. autoclass:: sagemaker.experiments.Experiment
18+
:members:
19+
20+
Other
21+
-------------
22+
1423
.. autoclass:: sagemaker.experiments.SortByType
1524
:members:
1625
:undoc-members:

doc/frameworks/djl/using_djl.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ You can either deploy your model using DeepSpeed or HuggingFace Accelerate, or l
3131
djl_model = DJLModel(
3232
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
3333
"my_sagemaker_role",
34-
data_type="fp16",
34+
dtype="fp16",
3535
task="text-generation",
3636
number_of_partitions=2 # number of gpus to partition the model across
3737
)
@@ -48,7 +48,7 @@ If you want to use a specific backend, then you can create an instance of the co
4848
deepspeed_model = DeepSpeedModel(
4949
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
5050
"my_sagemaker_role",
51-
data_type="bf16",
51+
dtype="bf16",
5252
task="text-generation",
5353
tensor_parallel_degree=2, # number of gpus to partition the model across using tensor parallelism
5454
)
@@ -58,7 +58,7 @@ If you want to use a specific backend, then you can create an instance of the co
5858
hf_accelerate_model = HuggingFaceAccelerateModel(
5959
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
6060
"my_sagemaker_role",
61-
data_type="fp16",
61+
dtype="fp16",
6262
task="text-generation",
6363
number_of_partitions=2, # number of gpus to partition the model across
6464
)
@@ -109,7 +109,7 @@ For example, you can deploy the EleutherAI gpt-j-6B model like this:
109109
model = DJLModel(
110110
"EleutherAI/gpt-j-6B",
111111
"my_sagemaker_role",
112-
data_type="fp16",
112+
dtype="fp16",
113113
number_of_partitions=2
114114
)
115115
@@ -142,7 +142,7 @@ You would then pass "s3://my_bucket/gpt-j-6B" as ``model_id`` to the ``DJLModel`
142142
model = DJLModel(
143143
"s3://my_bucket/gpt-j-6B",
144144
"my_sagemaker_role",
145-
data_type="fp16",
145+
dtype="fp16",
146146
number_of_partitions=2
147147
)
148148
@@ -213,7 +213,7 @@ For more information about DJL Serving, see the `DJL Serving documentation. <htt
213213
SageMaker DJL Classes
214214
***********************
215215

216-
For information about the different DJL Serving related classes in the SageMaker Python SDK, see https://sagemaker.readthedocs.io/en/stable/sagemaker.djl_inference.html.
216+
For information about the different DJL Serving related classes in the SageMaker Python SDK, see https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html.
217217

218218
********************************
219219
SageMaker DJL Serving Containers

requirements/extras/test_requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ fabric==2.6.0
1919
requests==2.27.1
2020
sagemaker-experiments==0.1.35
2121
Jinja2==3.0.3
22+
pyvis==0.2.1
2223
pandas>=1.3.5,<1.5
2324
scikit-learn==1.0.2
2425
cloudpickle==2.2.1

src/sagemaker/djl_inference/model.py

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,7 @@ def __init__(
233233
role: str,
234234
djl_version: Optional[str] = None,
235235
task: Optional[str] = None,
236-
data_type: str = "fp32",
236+
dtype: str = "fp32",
237237
number_of_partitions: Optional[int] = None,
238238
min_workers: Optional[int] = None,
239239
max_workers: Optional[int] = None,
@@ -264,7 +264,7 @@ def __init__(
264264
task (str): The HuggingFace/NLP task you want to launch this model for. Defaults to
265265
None.
266266
If not provided, the task will be inferred from the model architecture by DJL.
267-
data_type (str): The data type to use for loading your model. Accepted values are
267+
dtype (str): The data type to use for loading your model. Accepted values are
268268
"fp32", "fp16", "bf16", "int8". Defaults to "fp32".
269269
number_of_partitions (int): The number of GPUs to partition the model across. The
270270
partitioning strategy is determined by the selected backend. If DeepSpeed is
@@ -322,13 +322,20 @@ def __init__(
322322
"You only need to set model_id and ensure it points to uncompressed model "
323323
"artifacts in s3, or a valid HuggingFace Hub model_id."
324324
)
325+
data_type = kwargs.pop("data_type", None)
326+
if data_type:
327+
logger.warning(
328+
"data_type is being deprecated in favor of dtype. Please migrate use of data_type"
329+
" to dtype. Support for data_type will be removed in a future release"
330+
)
331+
dtype = dtype or data_type
325332
super(DJLModel, self).__init__(
326333
None, image_uri, role, entry_point, predictor_cls=predictor_cls, **kwargs
327334
)
328335
self.model_id = model_id
329336
self.djl_version = djl_version
330337
self.task = task
331-
self.data_type = data_type
338+
self.dtype = dtype
332339
self.number_of_partitions = number_of_partitions
333340
self.min_workers = min_workers
334341
self.max_workers = max_workers
@@ -372,7 +379,7 @@ def transformer(self, **_):
372379
"DJLModels do not currently support Batch Transform inference jobs"
373380
)
374381

375-
def right_size(self, checkpoint_data_type: str):
382+
def right_size(self, **_):
376383
"""Not implemented.
377384
378385
DJLModels do not support SageMaker Inference Recommendation Jobs.
@@ -573,8 +580,8 @@ def generate_serving_properties(self, serving_properties=None) -> Dict[str, str]
573580
serving_properties["option.entryPoint"] = self.entry_point
574581
if self.task:
575582
serving_properties["option.task"] = self.task
576-
if self.data_type:
577-
serving_properties["option.dtype"] = self.data_type
583+
if self.dtype:
584+
serving_properties["option.dtype"] = self.dtype
578585
if self.min_workers:
579586
serving_properties["minWorkers"] = self.min_workers
580587
if self.max_workers:
@@ -779,7 +786,7 @@ def __init__(
779786
None.
780787
load_in_8bit (bool): Whether to load the model in int8 precision using bits and bytes
781788
quantization. This is only supported for select model architectures.
782-
Defaults to False. If ``data_type`` is int8, then this is set to True.
789+
Defaults to False. If ``dtype`` is int8, then this is set to True.
783790
low_cpu_mem_usage (bool): Whether to limit CPU memory usage to 1x model size during
784791
model loading. This is an experimental feature in HuggingFace. This is useful when
785792
loading multiple instances of your model in parallel. Defaults to False.
@@ -832,19 +839,19 @@ def generate_serving_properties(self, serving_properties=None) -> Dict[str, str]
832839
if self.device_map:
833840
serving_properties["option.device_map"] = self.device_map
834841
if self.load_in_8bit:
835-
if self.data_type != "int8":
836-
raise ValueError("Set data_type='int8' to use load_in_8bit")
842+
if self.dtype != "int8":
843+
raise ValueError("Set dtype='int8' to use load_in_8bit")
837844
serving_properties["option.load_in_8bit"] = self.load_in_8bit
838-
if self.data_type == "int8":
845+
if self.dtype == "int8":
839846
serving_properties["option.load_in_8bit"] = True
840847
if self.low_cpu_mem_usage:
841848
serving_properties["option.low_cpu_mem_usage"] = self.low_cpu_mem_usage
842849
# This is a workaround due to a bug in our built in handler for huggingface
843850
# TODO: This needs to be fixed when new dlc is published
844851
if (
845852
serving_properties["option.entryPoint"] == "djl_python.huggingface"
846-
and self.data_type
847-
and self.data_type != "auto"
853+
and self.dtype
854+
and self.dtype != "auto"
848855
):
849856
serving_properties["option.dtype"] = "auto"
850857
serving_properties.pop("option.load_in_8bit", None)

src/sagemaker/experiments/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
from __future__ import absolute_import
1515

1616
from sagemaker.experiments.run import Run # noqa: F401
17+
from sagemaker.experiments.experiment import Experiment # noqa: F401
1718
from sagemaker.experiments.run import load_run # noqa: F401
1819
from sagemaker.experiments.run import list_runs # noqa: F401
1920
from sagemaker.experiments.run import SortOrderType # noqa: F401

src/sagemaker/experiments/experiment.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@
2222
from sagemaker.experiments.trial_component import _TrialComponent
2323

2424

25-
class _Experiment(_base_types.Record):
25+
class Experiment(_base_types.Record):
2626
"""An Amazon SageMaker experiment, which is a collection of related trials.
2727
28-
New experiments are created by calling `experiments.experiment._Experiment.create`.
29-
Existing experiments can be reloaded by calling `experiments.experiment._Experiment.load`.
28+
New experiments are created by calling `experiments.experiment.Experiment.create`.
29+
Existing experiments can be reloaded by calling `experiments.experiment.Experiment.load`.
3030
3131
Attributes:
3232
experiment_name (str): The name of the experiment. The name must be unique
@@ -73,7 +73,7 @@ def delete(self):
7373

7474
@classmethod
7575
def load(cls, experiment_name, sagemaker_session=None):
76-
"""Load an existing experiment and return an `_Experiment` object representing it.
76+
"""Load an existing experiment and return an `Experiment` object representing it.
7777
7878
Args:
7979
experiment_name: (str): Name of the experiment
@@ -83,7 +83,7 @@ def load(cls, experiment_name, sagemaker_session=None):
8383
default AWS configuration chain.
8484
8585
Returns:
86-
experiments.experiment._Experiment: A SageMaker `_Experiment` object
86+
experiments.experiment.Experiment: A SageMaker `Experiment` object
8787
"""
8888
return cls._construct(
8989
cls._boto_load_method,
@@ -100,7 +100,7 @@ def create(
100100
tags=None,
101101
sagemaker_session=None,
102102
):
103-
"""Create a new experiment in SageMaker and return an `_Experiment` object.
103+
"""Create a new experiment in SageMaker and return an `Experiment` object.
104104
105105
Args:
106106
experiment_name: (str): Name of the experiment. Must be unique. Required.
@@ -115,7 +115,7 @@ def create(
115115
(default: None).
116116
117117
Returns:
118-
experiments.experiment._Experiment: A SageMaker `_Experiment` object
118+
experiments.experiment.Experiment: A SageMaker `Experiment` object
119119
"""
120120
return cls._construct(
121121
cls._boto_create_method,
@@ -154,10 +154,10 @@ def _load_or_create(
154154
exist and a new experiment has to be created.
155155
156156
Returns:
157-
experiments.experiment._Experiment: A SageMaker `_Experiment` object
157+
experiments.experiment.Experiment: A SageMaker `Experiment` object
158158
"""
159159
try:
160-
experiment = _Experiment.create(
160+
experiment = Experiment.create(
161161
experiment_name=experiment_name,
162162
display_name=display_name,
163163
description=description,
@@ -170,7 +170,7 @@ def _load_or_create(
170170
if not (error_code == "ValidationException" and "already exists" in error_message):
171171
raise ce
172172
# already exists
173-
experiment = _Experiment.load(experiment_name, sagemaker_session)
173+
experiment = Experiment.load(experiment_name, sagemaker_session)
174174
return experiment
175175

176176
def list_trials(self, created_before=None, created_after=None, sort_by=None, sort_order=None):

src/sagemaker/experiments/run.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
)
3333
from sagemaker.experiments._environment import _RunEnvironment
3434
from sagemaker.experiments._run_context import _RunContext
35-
from sagemaker.experiments.experiment import _Experiment
35+
from sagemaker.experiments.experiment import Experiment
3636
from sagemaker.experiments._metrics import _MetricsManager
3737
from sagemaker.experiments.trial import _Trial
3838
from sagemaker.experiments.trial_component import _TrialComponent
@@ -166,7 +166,7 @@ def __init__(
166166
)
167167
self.run_group_name = Run._generate_trial_name(self.experiment_name)
168168

169-
self._experiment = _Experiment._load_or_create(
169+
self._experiment = Experiment._load_or_create(
170170
experiment_name=self.experiment_name,
171171
display_name=experiment_display_name,
172172
tags=tags,

0 commit comments

Comments
 (0)