Skip to content

documentation: add doc for SMDDP collectives #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: smddp-coll
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
c97c467
fix: type hint of PySparkProcessor __init__ (#3297)
NivekNey Dec 2, 2022
de58941
fix: fix PySparkProcessor __init__ params type (#3354)
andre-marcos-perez Dec 2, 2022
41dd330
fix: Allow Py 3.7 for MMS Test Docker env (#3080)
shreyapandit Dec 2, 2022
1e23a3f
refactoring : using with statement (#3286)
maldil Dec 2, 2022
19efadf
Update local_requirements.txt PyYAML version (#3095)
shreyapandit Dec 2, 2022
76f7782
feature: Update TF 2.9 and TF 2.10 inference DLCs (#3465)
arjkesh Dec 2, 2022
fde0738
feature: Added transform with monitoring pipeline step in transformer…
keshav-chandak Dec 2, 2022
7f9f3b0
fix: Fix bug forcing uploaded tar to be named sourcedir (#3412)
claytonparnell Dec 2, 2022
5d59767
feature: Add Code Owners file (#3503)
navinsoni Dec 2, 2022
0f5cf18
prepare release v2.119.0
Dec 3, 2022
f1f0013
update development version to v2.119.1.dev0
Dec 3, 2022
bb4b689
feature: Add DXB region to frameworks by DLC (#3387)
RadhikaB-97 Dec 5, 2022
b68bcd9
fix: support idempotency for framework and spark processors (#3460)
brockwade633 Dec 5, 2022
32969da
feature: Update registries with new region account number mappings. (…
kenny-ezirim Dec 6, 2022
767da0a
feature: Adding support for SageMaker Training Compiler in PyTorch es…
Lokiiiiii Dec 7, 2022
d779d1b
feature: Add Neo image uri config for Pytorch 1.12 (#3507)
HappyAmazonian Dec 7, 2022
83327fb
prepare release v2.120.0
Dec 7, 2022
5bffb04
update development version to v2.120.1.dev0
Dec 7, 2022
b828396
feature: Algorithms Region Expansion OSU/DXB (#3508)
malav-shastri Dec 7, 2022
357f732
fix: Add constraints file for apache-airflow (#3510)
navinsoni Dec 7, 2022
a28d1dd
fix: FrameworkProcessor S3 uploads (#3493)
brockwade633 Dec 8, 2022
11d2475
prepare release v2.121.0
Dec 8, 2022
24171b5
update development version to v2.121.1.dev0
Dec 8, 2022
d5847d5
Fix: Differentiate SageMaker Training Compiler's PT DLCs from base PT…
Lokiiiiii Dec 8, 2022
3f6ea88
fix: Fix failing jumpstart cache unit tests (#3514)
evakravi Dec 8, 2022
4570aa6
fix: Pop out ModelPackageName from pipeline definition (#3472)
qidewenwhen Dec 9, 2022
959ea1a
prepare release v2.121.1
Dec 9, 2022
b2e8b66
update development version to v2.121.2.dev0
Dec 9, 2022
355975d
fix: Skip Bad Transform Test (#3521)
amzn-choeric Dec 9, 2022
fadc817
fix: Revert "fix: type hint of PySparkProcessor __init__" (#3524)
mufaddal-rohawala Dec 9, 2022
c5fc93f
change: Update for Tensorflow Serving 2.11 inference DLCs (#3509)
hballuru Dec 9, 2022
ec8da98
prepare release v2.121.2
Dec 12, 2022
0352122
update development version to v2.121.3.dev0
Dec 12, 2022
aabe852
Add use_accl option to pytorchddp distribution
vishwakaria Sep 7, 2022
cf54133
Update API spec to use communication_options
vishwakaria Nov 24, 2022
35e72a8
Remove extra whitespace in logs
vishwakaria Nov 28, 2022
6b52088
Fix unit tests
vishwakaria Nov 28, 2022
690a1d3
Update comments
vishwakaria Nov 28, 2022
c7e7f0d
Update logs
vishwakaria Nov 29, 2022
558e3f8
Update warning logs
vishwakaria Dec 5, 2022
83d754e
documentation: add doc for smddp collectives
mchoi8739 Dec 12, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,72 @@
# Changelog

## v2.121.2 (2022-12-12)

### Bug Fixes and Other Changes

* Update for Tensorflow Serving 2.11 inference DLCs
* Revert "fix: type hint of PySparkProcessor __init__"
* Skip Bad Transform Test

## v2.121.1 (2022-12-09)

### Bug Fixes and Other Changes

* Pop out ModelPackageName from pipeline definition
* Fix failing jumpstart cache unit tests

## v2.121.0 (2022-12-08)

### Features

* Algorithms Region Expansion OSU/DXB

### Bug Fixes and Other Changes

* FrameworkProcessor S3 uploads
* Add constraints file for apache-airflow

## v2.120.0 (2022-12-07)

### Features

* Add Neo image uri config for Pytorch 1.12
* Adding support for SageMaker Training Compiler in PyTorch estimator starting 1.12
* Update registries with new region account number mappings.
* Add DXB region to frameworks by DLC

### Bug Fixes and Other Changes

* support idempotency for framework and spark processors

## v2.119.0 (2022-12-03)

### Features

* Add Code Owners file
* Added transform with monitoring pipeline step in transformer
* Update TF 2.9 and TF 2.10 inference DLCs
* make estimator accept json file as modelparallel config
* SageMaker Training Compiler does not support p4de instances
* Add support for SparkML v3.3

### Bug Fixes and Other Changes

* Fix bug forcing uploaded tar to be named sourcedir
* Update local_requirements.txt PyYAML version
* refactoring : using with statement
* Allow Py 3.7 for MMS Test Docker env
* fix PySparkProcessor __init__ params type
* type hint of PySparkProcessor __init__
* Return ARM XGB/SKLearn tags if `image_scope` is `inference_graviton`
* Update scipy to 1.7.3 to support M1 development envs
* Fixing type hints for Spark processor that has instance type/count params in reverse order
* Add DeepAR ap-northeast-3 repository.
* Fix AsyncInferenceConfig documentation typo
* fix ml_inf to ml_inf1 in Neo multi-version support
* Fix type annotations
* add neo mvp region accounts

## v2.118.0 (2022-12-01)

### Features
Expand Down
1 change: 1 addition & 0 deletions CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @aws/sagemaker-ml-frameworks
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.118.1.dev0
2.121.3.dev0
70 changes: 66 additions & 4 deletions doc/frameworks/pytorch/using_pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,8 @@ The function of installing packages using ``requirements.txt`` is supported for

A ``requirements.txt`` file is a text file that contains a list of items that are installed by using ``pip install``. You can also specify the version of an item to install. For information about the format of a ``requirements.txt`` file, see `Requirements Files <https://pip.pypa.io/en/stable/user_guide/#requirements-files>`__ in the pip documentation.

-----

Create an Estimator
===================

Expand All @@ -161,7 +163,7 @@ directories ('train' and 'test').
'test': 's3://my-data-bucket/path/to/my/test/data'})



-----

Call the fit Method
===================
Expand Down Expand Up @@ -196,6 +198,7 @@ fit Optional Arguments
- ``logs``: Defaults to True, whether to show logs produced by training
job in the Python session. Only meaningful when wait is True.

-----

Distributed PyTorch Training
============================
Expand All @@ -212,7 +215,7 @@ with the ``pytorchddp`` option as the distribution strategy.
.. note::

This PyTorch DDP support is available
in the SageMaker PyTorch Deep Learning Containers v1.12 and later.
in the SageMaker PyTorch Deep Learning Containers v1.11 and later.

Adapt Your Training Script
--------------------------
Expand All @@ -238,7 +241,6 @@ but you can also overwrite them.

**Supported backends:**

- ``gloo`` and ``tcp`` for CPU instances
- ``gloo`` and ``nccl`` for GPU instances

Launching a Distributed Training Job
Expand Down Expand Up @@ -268,7 +270,7 @@ during the PyTorch DDP initialization.
For more information about setting up PyTorch DDP in your training script,
see `Getting Started with Distributed Data Parallel
<https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
PyTorch documentation.
*PyTorch documentation*.

The following example shows how to run a PyTorch DDP training in SageMaker
using two ``ml.p4d.24xlarge`` instances:
Expand All @@ -293,6 +295,64 @@ using two ``ml.p4d.24xlarge`` instances:

pt_estimator.fit("s3://bucket/path/to/training/data")

SMDDP collectives for PyTorch DDP
---------------------------------

Starting PyTorch 1.12.1, when you run PyTorch DDP training jobs on SageMaker,
you can benefit from the performance gain by using *SMDDP collectives*
with *no code changes* in your PyTorch training script.
*SMDDP collectives* is the
collective communication primitives offered by the `SageMaker data parallelism library
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html>`_.
The SMDDP collectives provide a GPU-based
AllReduce operation optimized for the AWS network infrastructure and the Amazon EC2
instance topology. These collectives are designed to address suboptimal performance issues
in PyTorch DDP distributed training when using NCCL, the general-purpose collective communications library
from NVIDIA, on the AWS infrastructure.

The SMDDP collectives are on by default for the ``pytorchddp`` distribution option
if the job configuration is supported. To achieve the most optimal training with
faster communication, SageMaker compares the performance of the SMDDP and NCCL
collectives at runtime and automatically selects the backend that performs better.

This means that by default the backend parameter of the ``communication_options`` for
``pytorchddp`` distribution is set to ``auto``. The goal of the automated backend
selection is to abstract away the overhead of deciding the right communication backend
for your training job, while ensuring that you get the best possible training performance
at a lower cost.

The following code shows an example of setting up a PyTorch estimator
to run a PyTorch DDP training job on two ``ml.p4d.24xlarge`` instances
with the communication backed set to ``auto``.

.. code:: python

from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
entry_point="train_ptddp.py",
role="SageMakerRole",
framework_version="1.12.1",
py_version="py38",
instance_count=2,
instance_type="ml.p4d.24xlarge",
distribution={
"pytorchddp": {
"enabled": True,
"communication_options": { # Optional. This is set to auto by default.
"backend": "auto"
}
Comment on lines +342 to +344
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these lines bold?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately Sphinx code block doesn't support adding styles

Copy link
Collaborator Author

@mchoi8739 mchoi8739 Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code block every character is recognizes as straight code syntax, so it would show the element for bold style, instead of the bold style applied to the lines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}
)

pt_estimator.fit("s3://bucket/path/to/training/data")

If you want to turn this automatic backend selection off and force to use NCCL,
you can explicitly set ``"backend": "nccl"`` as the communication backend.

-----

.. _distributed-pytorch-training-on-trainium:

Distributed Training with PyTorch Neuron on Trn1 instances
Expand Down Expand Up @@ -408,6 +468,8 @@ on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:

pt_estimator.fit("s3://bucket/path/to/training/data")

-----

*********************
Deploy PyTorch Models
*********************
Expand Down
1 change: 1 addition & 0 deletions requirements/extras/test_requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ contextlib2==21.6.0
awslogs==0.14.0
black==22.3.0
stopit==1.1.2
# Update tox.ini to have correct version of airflow constraints file
apache-airflow==2.4.1
apache-airflow-providers-amazon==4.0.0
attrs==22.1.0
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def read_requirements(filename):
"protobuf3-to-dict>=0.1.5,<1.0",
"smdebug_rulesconfig==1.0.1",
"importlib-metadata>=1.4.0,<5.0",
"packaging>=20.0",
"packaging==20.9",
"pandas",
"pathos",
"schema",
Expand Down
64 changes: 59 additions & 5 deletions src/sagemaker/fw_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,9 +129,6 @@
}

PYTORCHDDP_SUPPORTED_FRAMEWORK_VERSIONS = [
"1.10",
"1.10.0",
"1.10.2",
"1.11",
"1.11.0",
"1.12",
Expand All @@ -145,6 +142,8 @@

TRAINIUM_SUPPORTED_DISTRIBUTION_STRATEGIES = ["torch_distributed"]

SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS = ("1.12.1",)
SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES = ("ml.p4d.24xlarge",)

SMDISTRIBUTED_SUPPORTED_STRATEGIES = ["dataparallel", "modelparallel"]

Expand Down Expand Up @@ -493,7 +492,7 @@ def framework_name_from_image(image_uri):
# We must support both the legacy and current image name format.
name_pattern = re.compile(
r"""^(?:sagemaker(?:-rl)?-)?
(tensorflow|mxnet|chainer|pytorch|scikit-learn|xgboost
(tensorflow|mxnet|chainer|pytorch|pytorch-trcomp|scikit-learn|xgboost
|huggingface-tensorflow|huggingface-pytorch
|huggingface-tensorflow-trcomp|huggingface-pytorch-trcomp)(?:-)?
(scriptmode|training)?
Expand Down Expand Up @@ -1043,7 +1042,6 @@ def validate_torch_distributed_distribution(
if not torch_distributed_enabled:
# Distribution strategy other than torch_distributed is selected
return

err_msg = ""
if not image_uri:
# ignore framework_version and py_version if image_uri is set
Expand Down Expand Up @@ -1082,6 +1080,62 @@ def validate_torch_distributed_distribution(
raise ValueError(err_msg)


def validate_smddp_collectives_support(
framework_version, py_version, image_uri, instance_type, instance_count
):
"""Check if SMDDP collective backend is supported for current invocation.

Args:
framework_version (str): A string representing the framework version selected.
py_version (str): A string representing the python version selected.
image_uri (str): A string representing a Docker image URI.
instance_type (str): SageMaker instance type.
instance_count (int): Number of training instances to use.

Returns false if:
`instance_type` is not in SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES or
`py_version` is not python3 or
`framework_version` is not in SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS or
`instance_count` is not greater than 1
"""
err_msg = ""
if not image_uri:
# ignore framework_version and py_version if image_uri is set
# in case image_uri is not set, then both are mandatory
if framework_version not in SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS:
err_msg += (
f"Provided framework_version {framework_version} is not supported. "
"Please specify one of the supported framework versions:"
f" {SMDDP_COLLECTIVES_SUPPORTED_FRAMEWORK_VERSIONS}.\n"
)
if "py3" not in py_version:
err_msg += (
f"Provided py_version {py_version} is not supported. "
"Please specify py_version>=py3.\n"
)
if instance_type not in SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES:
err_msg += (
f"Provided instance_type {instance_type} is not supported. "
"Please specify one of the supported instance types:"
f"{SMDDP_COLLECTIVES_SUPPORTED_INSTANCE_TYPES}.\n"
)
if instance_count == 1:
# Communication backend auto is not supported for single-node jobs
err_msg += (
"SMDDP Collective backend is not supported for single-node jobs. "
"Please increase instance_count to be greater than 1.\n"
)
if not err_msg:
return True
logger.warning(
"The system is not compatible or not configured to run SMDDP collectives optimized"
" for AWS infrastructure.\n%s",
err_msg,
)
logger.warning("Continuing model training with default NCCL communication backend.\n")
return False


def python_deprecation_warning(framework, latest_supported_version):
"""Placeholder docstring"""
return PYTHON_2_DEPRECATION_WARNING.format(
Expand Down
5 changes: 2 additions & 3 deletions src/sagemaker/git_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,9 +279,8 @@ def _run_clone_command(repo_url, dest_dir):
subprocess.check_call(["git", "clone", repo_url, dest_dir], env=my_env)
elif repo_url.startswith("git@"):
with tempfile.NamedTemporaryFile() as sshnoprompt:
write_pipe = open(sshnoprompt.name, "w")
write_pipe.write("ssh -oBatchMode=yes $@")
write_pipe.close()
with open(sshnoprompt.name, "w") as write_pipe:
write_pipe.write("ssh -oBatchMode=yes $@")
os.chmod(sshnoprompt.name, 0o511)
my_env["GIT_SSH"] = sshnoprompt.name
subprocess.check_call(["git", "clone", repo_url, dest_dir], env=my_env)
Expand Down
Loading