Skip to content

Commit 7b8c470

Browse files
authored
Merge branch 'dev' into dev
2 parents 64b5f51 + 169dffd commit 7b8c470

File tree

5 files changed

+97
-39
lines changed

5 files changed

+97
-39
lines changed

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

+37-18
Original file line numberDiff line numberDiff line change
@@ -243,16 +243,25 @@ TensorFlow API
243243

244244
.. function:: smdistributed.dataparallel.tensorflow.allreduce(tensor, param_index, num_params, compression=Compression.none, op=ReduceOp.AVERAGE)
245245

246-
Performs an all-reduce operation on a tensor (``tf.Tensor``).
246+
Performs an ``allreduce`` operation on a tensor (``tf.Tensor``).
247+
248+
The ``smdistributed.dataparallel`` package's AllReduce API for TensorFlow to allreduce
249+
gradient tensors. By default, ``smdistributed.dataparallel`` allreduce averages the
250+
gradient tensors across participating workers.
251+
252+
.. note::
253+
254+
:class:`smdistributed.dataparallel.tensorflow.allreduce()` should
255+
only be used to allreduce gradient tensors.
256+
For other (non-gradient) tensors, you must use
257+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`.
258+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
259+
for non-gradient tensors,
260+
the distributed training job might stall or stop.
247261

248-
``smdistributed.dataparallel`` AllReduce API can be used for all
249-
reducing gradient tensors or any other tensors. By
250-
default, ``smdistributed.dataparallel`` AllReduce averages the
251-
tensors across the participating workers.
252-
253262
**Inputs:**
254263

255-
- ``tensor (tf.Tensor)(required)``: The tensor to be all-reduced. The shape of the input must be identical across all ranks.
264+
- ``tensor (tf.Tensor)(required)``: The tensor to be allreduced. The shape of the input must be identical across all ranks.
256265
- ``param_index (int)(required):`` 0 if you are reducing a single tensor. Index of the tensor if you are reducing a list of tensors.
257266
- ``num_params (int)(required):`` len(tensor).
258267
- ``compression (smdistributed.dataparallel.tensorflow.Compression)(optional)``: Compression algorithm used to reduce the amount of data sent and received by each worker node. Defaults to not using compression.
@@ -306,9 +315,9 @@ TensorFlow API
306315

307316
.. function:: smdistributed.dataparallel.tensorflow.oob_allreduce(tensor, compression=Compression.none, op=ReduceOp.AVERAGE)
308317

309-
OutOfBand (oob) AllReduce is simplified AllReduce function for use cases
318+
Out-of-band (oob) AllReduce is simplified AllReduce function for use-cases
310319
such as calculating total loss across all the GPUs in the training.
311-
oob_allreduce average the tensors, as reduction operation, across the
320+
``oob_allreduce`` average the tensors, as reduction operation, across the
312321
worker nodes.
313322

314323
**Inputs:**
@@ -326,15 +335,25 @@ TensorFlow API
326335

327336
- ``None``
328337

329-
.. rubric:: Notes
330-
331-
``smdistributed.dataparallel.tensorflow.oob_allreduce``, in most
332-
cases, is ~2x slower
333-
than ``smdistributed.dataparallel.tensorflow.allreduce``  so it is not
334-
recommended to be used for performing gradient reduction during the
335-
training
336-
process. ``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
337-
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
338+
.. note::
339+
340+
In most cases, the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
341+
function is ~2x slower
342+
than :class:`smdistributed.dataparallel.tensorflow.allreduce()`. It is not
343+
recommended to use the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
344+
function for performing gradient
345+
reduction during the training process.
346+
``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
347+
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
348+
349+
.. note::
350+
351+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()` should
352+
only be used to allreduce non-gradient tensors.
353+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
354+
for non-gradient tensors,
355+
the distributed training job might stall or stop.
356+
To allreduce gradients, use :class:`smdistributed.dataparallel.tensorflow.allreduce()`.
338357

339358

340359
.. function:: smdistributed.dataparallel.tensorflow.overlap(tensor)

src/sagemaker/huggingface/estimator.py

+7-6
Original file line numberDiff line numberDiff line change
@@ -50,14 +50,15 @@ def __init__(
5050
compiler_config=None,
5151
**kwargs,
5252
):
53-
"""This ``Estimator`` executes a HuggingFace script in a managed execution environment.
53+
"""This estimator runs a Hugging Face training script in a SageMaker training environment.
5454
55-
The managed HuggingFace environment is an Amazon-built Docker container that executes
56-
functions defined in the supplied ``entry_point`` Python script within a SageMaker
57-
Training Job.
55+
The estimator initiates the SageMaker-managed Hugging Face environment
56+
by using the pre-built Hugging Face Docker container and runs
57+
the Hugging Face training script that user provides through
58+
the ``entry_point`` argument.
5859
59-
Training is started by calling
60-
:meth:`~sagemaker.amazon.estimator.Framework.fit` on this Estimator.
60+
After configuring the estimator class, use the class method
61+
:meth:`~sagemaker.amazon.estimator.Framework.fit()` to start a training job.
6162
6263
Args:
6364
py_version (str): Python version you want to use for executing your model training

src/sagemaker/model.py

+6-3
Original file line numberDiff line numberDiff line change
@@ -467,7 +467,7 @@ def _upload_code(self, key_prefix: str, repack: bool = False) -> None:
467467
)
468468

469469
def _script_mode_env_vars(self):
470-
"""Placeholder docstring"""
470+
"""Returns a mapping of environment variables for script mode execution"""
471471
script_name = None
472472
dir_name = None
473473
if self.uploaded_code:
@@ -479,8 +479,11 @@ def _script_mode_env_vars(self):
479479
elif self.entry_point is not None:
480480
script_name = self.entry_point
481481
if self.source_dir is not None:
482-
dir_name = "file://" + self.source_dir
483-
482+
dir_name = (
483+
self.source_dir
484+
if self.source_dir.startswith("s3://")
485+
else "file://" + self.source_dir
486+
)
484487
return {
485488
SCRIPT_PARAM_NAME.upper(): script_name or str(),
486489
DIR_PARAM_NAME.upper(): dir_name or str(),

src/sagemaker/training_compiler/config.py

+28-11
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,7 @@
1818

1919

2020
class TrainingCompilerConfig(object):
21-
"""The configuration class for accelerating SageMaker training jobs through compilation.
22-
23-
SageMaker Training Compiler speeds up training by optimizing the model execution graph.
24-
25-
"""
21+
"""The SageMaker Training Compiler configuration class."""
2622

2723
DEBUG_PATH = "/opt/ml/output/data/compiler/"
2824
SUPPORTED_INSTANCE_CLASS_PREFIXES = ["p3", "g4dn", "p4"]
@@ -37,9 +33,15 @@ def __init__(
3733
):
3834
"""This class initializes a ``TrainingCompilerConfig`` instance.
3935
40-
Pass the output of it to the ``compiler_config``
36+
`Amazon SageMaker Training Compiler
37+
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_
38+
is a feature of SageMaker Training
39+
and speeds up training jobs by optimizing model execution graphs.
40+
41+
You can compile Hugging Face models
42+
by passing the object of this configuration class to the ``compiler_config``
4143
parameter of the :class:`~sagemaker.huggingface.HuggingFace`
42-
class.
44+
estimator.
4345
4446
Args:
4547
enabled (bool): Optional. Switch to enable SageMaker Training Compiler.
@@ -48,13 +50,28 @@ def __init__(
4850
This comes with a potential performance slowdown.
4951
The default is ``False``.
5052
51-
**Example**: The following example shows the basic ``compiler_config``
52-
parameter configuration, enabling compilation with default parameter values.
53+
**Example**: The following code shows the basic usage of the
54+
:class:`sagemaker.huggingface.TrainingCompilerConfig()` class
55+
to run a HuggingFace training job with the compiler.
5356
5457
.. code-block:: python
5558
56-
from sagemaker.huggingface import TrainingCompilerConfig
57-
compiler_config = TrainingCompilerConfig()
59+
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig
60+
61+
huggingface_estimator=HuggingFace(
62+
...
63+
compiler_config=TrainingCompilerConfig()
64+
)
65+
66+
.. seealso::
67+
68+
For more information about how to enable SageMaker Training Compiler
69+
for various training settings such as using TensorFlow-based models,
70+
PyTorch-based models, and distributed training,
71+
see `Enable SageMaker Training Compiler
72+
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-enable.html>`_
73+
in the `Amazon SageMaker Training Compiler developer guide
74+
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_.
5875
5976
"""
6077

tests/unit/sagemaker/model/test_model.py

+19-1
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626
from sagemaker.sklearn.model import SKLearnModel
2727
from sagemaker.tensorflow.model import TensorFlowModel
2828
from sagemaker.xgboost.model import XGBoostModel
29+
from sagemaker.workflow.properties import Properties
30+
2931

3032
MODEL_DATA = "s3://bucket/model.tar.gz"
3133
MODEL_IMAGE = "mi"
@@ -42,7 +44,6 @@
4244
BRANCH = "test-branch-git-config"
4345
COMMIT = "ae15c9d7d5b97ea95ea451e4662ee43da3401d73"
4446
ENTRY_POINT_INFERENCE = "inference.py"
45-
4647
SCRIPT_URI = "s3://codebucket/someprefix/sourcedir.tar.gz"
4748
IMAGE_URI = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:1.9.0-gpu-py38"
4849

@@ -71,6 +72,23 @@ def sagemaker_session():
7172
return sms
7273

7374

75+
@patch("shutil.rmtree", MagicMock())
76+
@patch("tarfile.open", MagicMock())
77+
@patch("os.listdir", MagicMock(return_value=[ENTRY_POINT_INFERENCE]))
78+
def test_prepare_container_def_with_model_src_s3_returns_correct_url(sagemaker_session):
79+
model = Model(
80+
entry_point=ENTRY_POINT_INFERENCE,
81+
role=ROLE,
82+
sagemaker_session=sagemaker_session,
83+
source_dir=SCRIPT_URI,
84+
image_uri=MODEL_IMAGE,
85+
model_data=Properties("Steps.MyStep"),
86+
)
87+
container_def = model.prepare_container_def(INSTANCE_TYPE, "ml.eia.medium")
88+
89+
assert container_def["Environment"]["SAGEMAKER_SUBMIT_DIRECTORY"] == SCRIPT_URI
90+
91+
7492
def test_prepare_container_def_with_model_data():
7593
model = Model(MODEL_IMAGE)
7694
container_def = model.prepare_container_def(INSTANCE_TYPE, "ml.eia.medium")

0 commit comments

Comments
 (0)