Skip to content

Commit 8df2e4b

Browse files
authored
Merge branch 'master' into tf-2.10
2 parents 3a8fcf6 + 9607e06 commit 8df2e4b

File tree

17 files changed

+1040
-514
lines changed

17 files changed

+1040
-514
lines changed

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,26 @@
11
# Changelog
22

3+
## v2.114.0 (2022-10-26)
4+
5+
### Features
6+
7+
* Graviton support for XGB and SKLearn frameworks
8+
* Graviton support for PyTorch and Tensorflow frameworks
9+
* do not expand estimator role when it is pipeline parameter
10+
* added support for batch transform with model monitoring
11+
12+
### Bug Fixes and Other Changes
13+
14+
* regex in tuning integs
15+
* remove debugger environment var set up
16+
* adjacent slash in s3 key
17+
* Fix Repack step auto install behavior
18+
* Add retry for airflow ParsingError
19+
20+
### Documentation Changes
21+
22+
* doc fix
23+
324
## v2.113.0 (2022-10-21)
425

526
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.113.1.dev0
1+
2.114.1.dev0

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 0 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -293,107 +293,6 @@ using two ``ml.p4d.24xlarge`` instances:
293293
294294
pt_estimator.fit("s3://bucket/path/to/training/data")
295295
296-
.. _distributed-pytorch-training-on-trainium:
297-
298-
Distributed PyTorch Training on Trainium
299-
========================================
300-
301-
SageMaker Training on Trainium instances now supports the ``xla``
302-
package through ``torchrun``. With this, you do not need to manually pass RANK,
303-
WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
304-
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
305-
with the ``torch_distributed`` option as the distribution strategy.
306-
307-
.. note::
308-
309-
This ``torch_distributed`` support is available
310-
in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
311-
To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.
312-
313-
SageMaker Debugger and Profiler are currently not supported with Trainium instances.
314-
315-
Adapt Your Training Script to Initialize with the XLA backend
316-
-------------------------------------------------------------
317-
318-
To initialize distributed training in your script, call
319-
`torch.distributed.init_process_group
320-
<https://pytorch.org/docs/master/distributed.html#torch.distributed.init_process_group>`_
321-
with the ``xla`` backend as shown below.
322-
323-
.. code:: python
324-
325-
import torch.distributed as dist
326-
327-
dist.init_process_group('xla')
328-
329-
SageMaker takes care of ``'MASTER_ADDR'`` and ``'MASTER_PORT'`` for you via ``torchrun``
330-
331-
For detailed documentation about modifying your training script for Trainium, see `Multi-worker data-parallel MLP training using torchrun <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun>`_ in the *AWS Neuron Documentation*.
332-
333-
**Currently Supported backends:**
334-
335-
- ``xla`` for Trainium (Trn1) instances
336-
337-
For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.
338-
339-
Launching a Distributed Training Job on Trainium
340-
------------------------------------------------
341-
342-
You can run multi-node distributed PyTorch training jobs on Trainium instances using the
343-
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class.
344-
With ``instance_count=1``, the estimator submits a
345-
single-node training job to SageMaker; with ``instance_count`` greater
346-
than one, a multi-node training job is launched.
347-
348-
With the ``torch_distributed`` option, the SageMaker PyTorch estimator runs a SageMaker
349-
training container for PyTorch Neuron, sets up the environment, and launches
350-
the training job using the ``torchrun`` command on each worker with the given information.
351-
352-
**Examples**
353-
354-
The following examples show how to run a PyTorch training using ``torch_distributed`` in SageMaker
355-
on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:
356-
357-
.. code:: python
358-
359-
from sagemaker.pytorch import PyTorch
360-
361-
pt_estimator = PyTorch(
362-
entry_point="train_ptddp.py",
363-
role="SageMakerRole",
364-
framework_version="1.11.0",
365-
py_version="py38",
366-
instance_count=1,
367-
instance_type="ml.trn1.2xlarge",
368-
distribution={
369-
"torch_distributed": {
370-
"enabled": True
371-
}
372-
}
373-
)
374-
375-
pt_estimator.fit("s3://bucket/path/to/training/data")
376-
377-
.. code:: python
378-
379-
from sagemaker.pytorch import PyTorch
380-
381-
pt_estimator = PyTorch(
382-
entry_point="train_ptddp.py",
383-
role="SageMakerRole",
384-
framework_version="1.11.0",
385-
py_version="py38",
386-
instance_count=2,
387-
instance_type="ml.trn1.32xlarge",
388-
distribution={
389-
"torch_distributed": {
390-
"enabled": True
391-
}
392-
}
393-
)
394-
395-
pt_estimator.fit("s3://bucket/path/to/training/data")
396-
397296
*********************
398297
Deploy PyTorch Models
399298
*********************

src/sagemaker/estimator.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -865,10 +865,6 @@ def _prepare_debugger_for_training(self):
865865
self.debugger_rule_configs = self._prepare_debugger_rules()
866866
self._prepare_collection_configs()
867867
self._validate_and_set_debugger_configs()
868-
if not self.debugger_hook_config:
869-
if self.environment is None:
870-
self.environment = {}
871-
self.environment[DEBUGGER_FLAG] = "0"
872868

873869
def _validate_and_set_debugger_configs(self):
874870
"""Set defaults for debugging."""

src/sagemaker/fw_utils.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,12 @@
146146
SMDISTRIBUTED_SUPPORTED_STRATEGIES = ["dataparallel", "modelparallel"]
147147

148148

149+
GRAVITON_ALLOWED_TARGET_INSTANCE_FAMILY = ["c6g", "t4g", "r6g", "m6g"]
150+
151+
152+
GRAVITON_ALLOWED_FRAMEWORKS = set(["tensorflow", "pytorch", "xgboost", "sklearn"])
153+
154+
149155
def validate_source_dir(script, directory):
150156
"""Validate that the source directory exists and it contains the user script.
151157
@@ -165,12 +171,6 @@ def validate_source_dir(script, directory):
165171
return True
166172

167173

168-
GRAVITON_ALLOWED_TARGET_INSTANCE_FAMILY = ["c6g", "t4g", "r6g", "m6g"]
169-
170-
171-
GRAVITON_ALLOWED_FRAMEWORKS = set(["tensorflow", "pytorch"])
172-
173-
174174
def validate_source_code_input_against_pipeline_variables(
175175
entry_point: Optional[Union[str, PipelineVariable]] = None,
176176
source_dir: Optional[Union[str, PipelineVariable]] = None,

0 commit comments

Comments
 (0)