Skip to content

Commit 95e2ca9

Browse files
authored
Merge branch 'dev' into feat/custom-job-name-for-jumpstart
2 parents b47b1d5 + 4325fcd commit 95e2ca9

33 files changed

+3565
-3106
lines changed

.readthedocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
version: 2
66

77
python:
8-
version: 3.6
8+
version: 3.9
99
install:
1010
- method: pip
1111
path: .

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -243,16 +243,25 @@ TensorFlow API
243243

244244
.. function:: smdistributed.dataparallel.tensorflow.allreduce(tensor, param_index, num_params, compression=Compression.none, op=ReduceOp.AVERAGE)
245245

246-
Performs an all-reduce operation on a tensor (``tf.Tensor``).
246+
Performs an ``allreduce`` operation on a tensor (``tf.Tensor``).
247+
248+
The ``smdistributed.dataparallel`` package's AllReduce API for TensorFlow to allreduce
249+
gradient tensors. By default, ``smdistributed.dataparallel`` allreduce averages the
250+
gradient tensors across participating workers.
251+
252+
.. note::
253+
254+
:class:`smdistributed.dataparallel.tensorflow.allreduce()` should
255+
only be used to allreduce gradient tensors.
256+
For other (non-gradient) tensors, you must use
257+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`.
258+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
259+
for non-gradient tensors,
260+
the distributed training job might stall or stop.
247261

248-
``smdistributed.dataparallel`` AllReduce API can be used for all
249-
reducing gradient tensors or any other tensors. By
250-
default, ``smdistributed.dataparallel`` AllReduce averages the
251-
tensors across the participating workers.
252-
253262
**Inputs:**
254263

255-
- ``tensor (tf.Tensor)(required)``: The tensor to be all-reduced. The shape of the input must be identical across all ranks.
264+
- ``tensor (tf.Tensor)(required)``: The tensor to be allreduced. The shape of the input must be identical across all ranks.
256265
- ``param_index (int)(required):`` 0 if you are reducing a single tensor. Index of the tensor if you are reducing a list of tensors.
257266
- ``num_params (int)(required):`` len(tensor).
258267
- ``compression (smdistributed.dataparallel.tensorflow.Compression)(optional)``: Compression algorithm used to reduce the amount of data sent and received by each worker node. Defaults to not using compression.
@@ -306,9 +315,9 @@ TensorFlow API
306315

307316
.. function:: smdistributed.dataparallel.tensorflow.oob_allreduce(tensor, compression=Compression.none, op=ReduceOp.AVERAGE)
308317

309-
OutOfBand (oob) AllReduce is simplified AllReduce function for use cases
318+
Out-of-band (oob) AllReduce is simplified AllReduce function for use-cases
310319
such as calculating total loss across all the GPUs in the training.
311-
oob_allreduce average the tensors, as reduction operation, across the
320+
``oob_allreduce`` average the tensors, as reduction operation, across the
312321
worker nodes.
313322

314323
**Inputs:**
@@ -326,15 +335,25 @@ TensorFlow API
326335

327336
- ``None``
328337

329-
.. rubric:: Notes
330-
331-
``smdistributed.dataparallel.tensorflow.oob_allreduce``, in most
332-
cases, is ~2x slower
333-
than ``smdistributed.dataparallel.tensorflow.allreduce``  so it is not
334-
recommended to be used for performing gradient reduction during the
335-
training
336-
process. ``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
337-
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
338+
.. note::
339+
340+
In most cases, the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
341+
function is ~2x slower
342+
than :class:`smdistributed.dataparallel.tensorflow.allreduce()`. It is not
343+
recommended to use the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
344+
function for performing gradient
345+
reduction during the training process.
346+
``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
347+
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
348+
349+
.. note::
350+
351+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()` should
352+
only be used to allreduce non-gradient tensors.
353+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
354+
for non-gradient tensors,
355+
the distributed training job might stall or stop.
356+
To allreduce gradients, use :class:`smdistributed.dataparallel.tensorflow.allreduce()`.
338357

339358

340359
.. function:: smdistributed.dataparallel.tensorflow.overlap(tensor)

doc/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
1111
# ANY KIND, either express or implied. See the License for the specific
1212
# language governing permissions and limitations under the License.
13-
"""Placeholder docstring"""
13+
"""Configuration for generating readthedocs docstrings."""
1414
from __future__ import absolute_import
1515

1616
import pkg_resources

src/sagemaker/huggingface/estimator.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -50,14 +50,15 @@ def __init__(
5050
compiler_config=None,
5151
**kwargs,
5252
):
53-
"""This ``Estimator`` executes a HuggingFace script in a managed execution environment.
53+
"""This estimator runs a Hugging Face training script in a SageMaker training environment.
5454
55-
The managed HuggingFace environment is an Amazon-built Docker container that executes
56-
functions defined in the supplied ``entry_point`` Python script within a SageMaker
57-
Training Job.
55+
The estimator initiates the SageMaker-managed Hugging Face environment
56+
by using the pre-built Hugging Face Docker container and runs
57+
the Hugging Face training script that user provides through
58+
the ``entry_point`` argument.
5859
59-
Training is started by calling
60-
:meth:`~sagemaker.amazon.estimator.Framework.fit` on this Estimator.
60+
After configuring the estimator class, use the class method
61+
:meth:`~sagemaker.amazon.estimator.Framework.fit()` to start a training job.
6162
6263
Args:
6364
py_version (str): Python version you want to use for executing your model training

src/sagemaker/image_uri_config/neo-tensorflow.json

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
"1.11.0": "1.15.3",
1313
"1.12.0": "1.15.3",
1414
"1.13.0": "1.15.3",
15-
"1.14.0": "1.15.3"
15+
"1.14.0": "1.15.3",
16+
"2.4.2": "2.4.2"
1617
},
1718
"versions": {
1819
"1.15.3": {
@@ -44,6 +45,36 @@
4445
"us-west-2": "301217895009"
4546
},
4647
"repository": "sagemaker-inference-tensorflow"
48+
},
49+
"2.4.2": {
50+
"py_versions": ["py3"],
51+
"registries": {
52+
"af-south-1": "774647643957",
53+
"ap-east-1": "110948597952",
54+
"ap-northeast-1": "941853720454",
55+
"ap-northeast-2": "151534178276",
56+
"ap-northeast-3": "925152966179",
57+
"ap-south-1": "763008648453",
58+
"ap-southeast-1": "324986816169",
59+
"ap-southeast-2": "355873309152",
60+
"ca-central-1": "464438896020",
61+
"cn-north-1": "472730292857",
62+
"cn-northwest-1": "474822919863",
63+
"eu-central-1": "746233611703",
64+
"eu-north-1": "601324751636",
65+
"eu-south-1": "966458181534",
66+
"eu-west-1": "802834080501",
67+
"eu-west-2": "205493899709",
68+
"eu-west-3": "254080097072",
69+
"me-south-1": "836785723513",
70+
"sa-east-1": "756306329178",
71+
"us-east-1": "785573368785",
72+
"us-east-2": "007439368137",
73+
"us-gov-west-1": "263933020539",
74+
"us-west-1": "710691900526",
75+
"us-west-2": "301217895009"
76+
},
77+
"repository": "sagemaker-inference-tensorflow"
4778
}
4879
}
4980
}

src/sagemaker/model.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,7 @@ def _upload_code(self, key_prefix: str, repack: bool = False) -> None:
466466
)
467467

468468
def _script_mode_env_vars(self):
469-
"""Placeholder docstring"""
469+
"""Returns a mapping of environment variables for script mode execution"""
470470
script_name = None
471471
dir_name = None
472472
if self.uploaded_code:
@@ -478,8 +478,11 @@ def _script_mode_env_vars(self):
478478
elif self.entry_point is not None:
479479
script_name = self.entry_point
480480
if self.source_dir is not None:
481-
dir_name = "file://" + self.source_dir
482-
481+
dir_name = (
482+
self.source_dir
483+
if self.source_dir.startswith("s3://")
484+
else "file://" + self.source_dir
485+
)
483486
return {
484487
SCRIPT_PARAM_NAME.upper(): script_name or str(),
485488
DIR_PARAM_NAME.upper(): dir_name or str(),

src/sagemaker/serializers.py

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@
1818
import csv
1919
import io
2020
import json
21-
2221
import numpy as np
2322
from six import with_metaclass
2423

@@ -357,3 +356,37 @@ def serialize(self, data):
357356
return data.read()
358357

359358
raise ValueError("Unable to handle input format: %s" % type(data))
359+
360+
361+
class DataSerializer(SimpleBaseSerializer):
362+
"""Serialize data in any file by extracting raw bytes from the file."""
363+
364+
def __init__(self, content_type="file-path/raw-bytes"):
365+
"""Initialize a ``DataSerializer`` instance.
366+
367+
Args:
368+
content_type (str): The MIME type to signal to the inference endpoint when sending
369+
request data (default: "file-path/raw-bytes").
370+
"""
371+
super(DataSerializer, self).__init__(content_type=content_type)
372+
373+
def serialize(self, data):
374+
"""Serialize file data to a raw bytes.
375+
376+
Args:
377+
data (object): Data to be serialized. The data can be a string
378+
representing file-path or the raw bytes from a file.
379+
Returns:
380+
raw-bytes: The data serialized as raw-bytes from the input.
381+
"""
382+
if isinstance(data, str):
383+
try:
384+
with open(data, "rb") as data_file:
385+
data_file_info = data_file.read()
386+
return data_file_info
387+
except Exception as e:
388+
raise ValueError(f"Could not open/read file: {data}. {e}")
389+
if isinstance(data, bytes):
390+
return data
391+
392+
raise ValueError(f"Object of type {type(data)} is not Data serializable.")

src/sagemaker/training_compiler/config.py

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,7 @@
1818

1919

2020
class TrainingCompilerConfig(object):
21-
"""The configuration class for accelerating SageMaker training jobs through compilation.
22-
23-
SageMaker Training Compiler speeds up training by optimizing the model execution graph.
24-
25-
"""
21+
"""The SageMaker Training Compiler configuration class."""
2622

2723
DEBUG_PATH = "/opt/ml/output/data/compiler/"
2824
SUPPORTED_INSTANCE_CLASS_PREFIXES = ["p3", "g4dn", "p4"]
@@ -37,9 +33,15 @@ def __init__(
3733
):
3834
"""This class initializes a ``TrainingCompilerConfig`` instance.
3935
40-
Pass the output of it to the ``compiler_config``
36+
`Amazon SageMaker Training Compiler
37+
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_
38+
is a feature of SageMaker Training
39+
and speeds up training jobs by optimizing model execution graphs.
40+
41+
You can compile Hugging Face models
42+
by passing the object of this configuration class to the ``compiler_config``
4143
parameter of the :class:`~sagemaker.huggingface.HuggingFace`
42-
class.
44+
estimator.
4345
4446
Args:
4547
enabled (bool): Optional. Switch to enable SageMaker Training Compiler.
@@ -48,13 +50,28 @@ def __init__(
4850
This comes with a potential performance slowdown.
4951
The default is ``False``.
5052
51-
**Example**: The following example shows the basic ``compiler_config``
52-
parameter configuration, enabling compilation with default parameter values.
53+
**Example**: The following code shows the basic usage of the
54+
:class:`sagemaker.huggingface.TrainingCompilerConfig()` class
55+
to run a HuggingFace training job with the compiler.
5356
5457
.. code-block:: python
5558
56-
from sagemaker.huggingface import TrainingCompilerConfig
57-
compiler_config = TrainingCompilerConfig()
59+
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig
60+
61+
huggingface_estimator=HuggingFace(
62+
...
63+
compiler_config=TrainingCompilerConfig()
64+
)
65+
66+
.. seealso::
67+
68+
For more information about how to enable SageMaker Training Compiler
69+
for various training settings such as using TensorFlow-based models,
70+
PyTorch-based models, and distributed training,
71+
see `Enable SageMaker Training Compiler
72+
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-enable.html>`_
73+
in the `Amazon SageMaker Training Compiler developer guide
74+
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_.
5875
5976
"""
6077

src/sagemaker/workflow/steps.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,8 @@ def arguments(self) -> RequestType:
301301
)
302302
request_dict = self.estimator.sagemaker_session._get_train_request(**train_args)
303303
request_dict.pop("TrainingJobName")
304+
if "HyperParameters" in request_dict:
305+
request_dict["HyperParameters"].pop("sagemaker_job_name", None)
304306

305307
return request_dict
306308

tests/data/cuteCat.raw

6.43 KB
Binary file not shown.

tests/integ/__init__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -148,12 +148,6 @@
148148
"eu-west-2",
149149
"us-east-1",
150150
]
151-
NO_SM_PIPELINE_MM_CLARIFY_CHECK_STEP_REGIONS = [
152-
"ap-northeast-3",
153-
"ap-south-1",
154-
"eu-north-1",
155-
"sa-east-1",
156-
]
157151
EDGE_PACKAGING_SUPPORTED_REGIONS = [
158152
"us-east-2",
159153
"us-west-2",

tests/integ/sagemaker/lineage/conftest.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@
2626
artifact,
2727
)
2828
from sagemaker.model import ModelPackage
29-
from tests.integ.test_workflow import test_end_to_end_pipeline_successful_execution
29+
from tests.integ.sagemaker.workflow.test_workflow import (
30+
test_end_to_end_pipeline_successful_execution,
31+
)
3032
from sagemaker.workflow.pipeline import _PipelineExecution
3133
from sagemaker.session import get_execution_role
3234
from smexperiments import trial_component, trial, experiment

tests/integ/sagemaker/workflow/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)